foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA).

Provided tools & services

FoLiA-2text

Convert FoLiA documents into plain text
Type
  • Command-line Application
Executable name
FoLiA-2text
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
TextDigitalDocument
Encoding Format
text/plain

FoLiA-alto

Convert ALTO DIDL files into a series of FoLiA documents
Type
  • Command-line Application
Executable name
FoLiA-alto

FoLiA-clean

FoLiA-clean will produce a cleaned up version of a FoLiA file, or a whole directory of FoLiA files, removing specified annotation types and specified text classes
Type
  • Command-line Application
Executable name
FoLiA-clean
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-collect

Collect n-gram statistics from tsv files produced by FoLiA-stats, aggregating results.
Type
  • Command-line Application
Executable name
FoLiA-collect
Input data
Type
Dataset
Encoding Format
text/tab-seperated-values
Output data
Type
Dataset
Encoding Format
text/tab-separated-values

FoLiA-correct

Correct FoLiA documents using correction candidates generated by TICCL-rank (from ticcltools)
Type
  • Command-line Application
Executable name
FoLiA-correct
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-hocr

Convert hOCR (as outputted by Tesseract) to FoLiA
Type
  • Command-line Application
Executable name
FoLiA-hocr
Input data
Type
TextDigitalDocument
Encoding Format
text/html
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-idf

Count words in a series of FoLiA documents and compute IDF statistics, which are outputted to a tsv file
Type
  • Command-line Application
Executable name
FoLiA-idf
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
Dataset
Encoding Format
text/tab-separated-values

FoLiA-langcat

Language Identification using textcat.
Type
  • Command-line Application
Executable name
FoLiA-langcat
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-page

Convert PAGE XML to FoLiA
Type
  • Command-line Application
Executable name
FoLiA-page
Input data
Type
TextDigitalDocument
Encoding Format
application/page+xml
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-pm

Convert Political Maskup XML to FoLiA
Type
  • Command-line Application
Executable name
/FoLiA-pm
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-stats

Gather n-gram statistics over a series of FoLiA documents
Type
  • Command-line Application
Executable name
FoLiA-stats
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
Dataset
Encoding Format
text/tab-separated-values

FoLiA-txt

Convert plain text to FoLiA, the output will contain only <p> and <str> nodes. See ucto or rst2folia (FoLiA-tools) for alternatives.
Type
  • Command-line Application
Executable name
FoLiA-txt
Input data
Type
TextDigitalDocument
Encoding Format
text/plain
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

FoLiA-wordtranslate

Simple word-by-word translator on the basis of a dictonary and/or rewrite rules
Type
  • Command-line Application
Executable name
FoLiA-wordtranslate
Input data
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Output data
Type
TextDigitalDocument
Encoding Format
application/folia+xml

Tool suite: FoLiA

The following closely related tools are in a tool suite together with foliautils:

  • Command-line Application
  • 9 - Proven: Technology complete and proven in practice by real users.
  • Active: The project has reached a stable, usable state and is being actively developed.

FoLiA tools 2.5.9

  •   KNAW Humanities Cluster & CLST, Radboud University
FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation) [view more]
  • Annotating
  • https://w3id.org/nwo-research-fields#ComputationalLinguisticsandPhilology
  • Textual and linguistic corpora
  • annotation
  • computational linguistics
  • folia
  • nlp
  • search
  • Bsd
  • Linux
  • Macos
  • Python
Created: 2011-01-14
Modified: 2025-05-08
  • Software Library
  • 9 - Proven: Technology complete and proven in practice by real users.
  • Active: The project has reached a stable, usable state and is being actively developed.

FoLiApy 2.5.12

  •   KNAW Humanities Cluster & CLST, Radboud University
An extensive library for processing FoLiA documents. FoLiA stands for Format for Linguistic Annotation and is a very rich XML-based format used by various Natural Language Processing tools. [view more]
  • Annotating
  • https://w3id.org/nwo-research-fields#ComputationalLinguisticsandPhilology
  • Textual and linguistic corpora
  • annotation
  • computational linguistics
  • folia
  • format
  • nlp
  • xml
  • Bsd
  • Linux
  • Macos
  • Python
Created: 2010-05-27
Modified: 2024-10-11
  • Software Library
  • Inactive: The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

folia 0.0.6

  •   KNAW Humanities Cluster & CLST, Radboud University
High-performance library for handling the FoLiA XML format (Format for Linguistic Annotation) [view more]
  • ['science', 'text-processing']
  • annotation
  • linguistics
  • nlp
  • text-processing
  • xml
Created: 2019-06-08
Modified: 2020-11-16
  • Command-line Application
  • Software Library
  • Active: The project has reached a stable, usable state and is being actively developed.

libfolia 2.21.1

This is a C++ Library for working with the Format for Linguistic Annotation (FoLiA). [view more]
  • folia
  • linguistic annotation
  • natural language processing
  • nlp
  • xml
  • Posix
  • Web Application
  • 8 - Complete: Technology complete and qualified, released for all end-users in scholarly environments.
  • Inactive: The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

piereling 0.4

  •   KNAW Humanities Cluster & CLST, Radboud University
Piereling is a webservice and web-application to convert between a variety of document formats, mostly from and to FoLiA XML. It is intended for NLP pipelines. [view more]
  • Internet > WWW/HTTP > WSGI > Application
  • Text Processing > Linguistic
  • webservice nlp computational_linguistics rest folia conversion
  • Bsd
  • Linux
  • Macos
  • Python
Created: 2019-10-18
Modified: 2023-11-01

References

Citation

Please use one of the above reference publications to cite the software, if you want to cite the software directly, you can use the following citation generated from the metadata:

Logs & Reviews

Name
Automatic software metadata validation report for foliautils 0.23
Author
  • codemetapy validator using software.ttl
Date
2025-06-25 04:17:12
Review
Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems

Validation of foliautils 0.23 was successful (score=3/5), but there are some warnings which should be addressed:

1. Warning: Documentation *SHOULD* be expressed (This is missing in the metadata)
2. Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)
3. Info: The technology readiness level *SHOULD* be expressed (This is missing in the metadata)
Rating
★ ★ ★ ☆ ☆
(log file starts at Wed Jun 25 04:17:06 UTC 2025)

[harvester info] --> Processing foliautils (https://github.com/languagemachines/foliautils) [Wed Jun 25 04:17:06 UTC 2025]

[harvester info] Git updating cached clone of https://github.com/languagemachines/foliautils...

[harvester info] Found release v0.23

[harvester info] Using 'v0.23'

[harvester info] Git reference: v0.23

[harvester info] Scanning directory /tmp/codemeta-harvester.cache/foliautils for harvestable resources...

[harvester info] found codemeta.json for foliautils (md5sum 74a53e2b5996399c452d74b8594d4daf); **NOTE: this is considered authoritative and most other detection methods will be skipped now!**

[harvester info] Inferring repostatus information from git activity (used only as a fallback if not explicitly provided)...

[harvester info] Inferred repostatus https://www.repostatus.org/#active

[harvester info] Looking for repostatus information in README.md in master branch...

[harvester info] Found repostatus (master branch) https://www.repostatus.org/#active

[harvester info] Looking for repostatus information in README in master branch...

[harvester info] Parsing MAINTAINERS from master branch...

[harvester info] Setting group FoLiA

[harvester info] Reconciliating: codemetapy  --baseuri https://tools.clariah.nl --baseuri https://tools.clariah.nl --includecontext --addcontext https://w3id.org/nwo-research-fields --addcontext https://w3id.org/research-technology-readiness-levels --addcontextgraph https://vocabs.dariah.eu/rest/v1/tadirah/data?format=text/turtle --trl --identifier "foliautils" --codeRepository "https://github.com/languagemachines/foliautils" --validate /etc/software.ttl --released --enrich --textv "Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems" -O /tmp/out/foliautils.codemeta.json /tmp/codemeta-harvester.cache//tmp/99-repostatus.foliautils.codemeta.json /tmp/codemeta-harvester.cache//tmp/10-jsonld.foliautils.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-repostatus.foliautils.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-maintainers.foliautils.codemeta.json /tmp/codemeta-harvester.cache//tmp/04-applicationSuite.foliautils.codemeta.json 

-- begin log --

Passed 5 files/sources but specified 0 input types! Automatically guessing types...

Detected input types: [('/tmp/codemeta-harvester.cache//tmp/99-repostatus.foliautils.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/10-jsonld.foliautils.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-repostatus.foliautils.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-maintainers.foliautils.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/04-applicationSuite.foliautils.codemeta.json', 'json')]

Adding to contextgraph: /tmp/turtle

Initial URI automatically generated, may be overriden later: https://tools.clariah.nl/foliautils

Processing source #1 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/99-repostatus.foliautils.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/foliautils

[CODEMETA COMPOSITION (https://tools.clariah.nl/foliautils)] processed 1 new triples, total is now 2

Processing source #2 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/10-jsonld.foliautils.codemeta.json

    Injected (possibly temporary) URI https://tools.clariah.nl/foliautils

[CODEMETA 2 TO 3] Updating contIntegration -> continuousIntegration

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA 2 TO 3] Updating targetProduct -> isSourceCodeOf

[CODEMETA CORRECTION (foliautils)] automatically converting spdx license URI from https:// to http:///

[CODEMETA COMPOSITION (foliautils)] processed 198 new triples, total is now 198

Processing source #3 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-repostatus.foliautils.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/foliautils

[CODEMETA COMPOSITION (foliautils)] processed 1 new triples, total is now 198

Processing source #4 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-maintainers.foliautils.codemeta.json

    Found main resource with URI https://tools.clariah.nl/maintainers/snapshot

    Injected (possibly temporary) URI https://tools.clariah.nl/foliautils

[CODEMETA COMPOSITION (foliautils)] processed 14 new triples, total is now 211

Processing source #5 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/04-applicationSuite.foliautils.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/foliautils

[CODEMETA COMPOSITION (foliautils)] processed 1 new triples, total is now 212

Remapping URI to (possibly) new identifier and version component: https://tools.clariah.nl/foliautils -> https://tools.clariah.nl/foliautils/0.23

[CODEMETA VALIDATION (foliautils)] done

[CODEMETA ENRICHMENT (foliautils)] adding author https://tools.clariah.nl/stub/H0300ba4d310e7cb6 as contributor

[CODEMETA ENRICHMENT (foliautils)] adding author https://orcid.org/0000-0002-1046-0006 as contributor

[CODEMETA ENRICHMENT (foliautils)] adding affiliation(s) of first author as producer

VALIDATION https://tools.clariah.nl/foliautils/0.23 #1: Warning: Documentation *SHOULD* be expressed (This is missing in the metadata)

VALIDATION https://tools.clariah.nl/foliautils/0.23 #2: Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)

VALIDATION https://tools.clariah.nl/foliautils/0.23 #3: Info: The technology readiness level *SHOULD* be expressed (This is missing in the metadata)

-- end log --

[harvester info] Output written to /tmp/out/foliautils.codemeta.json

[harvester info] <-- Finished processing foliautils (https://github.com/languagemachines/foliautils) [Wed Jun 25 04:17:12 UTC 2025]

        

Metadata Properties

Version
0.23 (release notes)
Interface types
  • Command-line Application
Software website
Source code repository
 https://github.com/languagemachines/foliautils  Stars are an indicator of the popularity of this project on GitHub
Keywords
  • folia
  • linguistic annotation
  • natural language processing
  • nlp
  • xml
Development Status
  • Active: The project has reached a stable, usable state and is being actively developed.
Issue Tracker (Support)
https://github.com/LanguageMachines/foliautils/issues  The number of open issues on the issue tracker  The number of closes issues on the issue tracker
Documentation
License
Author(s)
Maintainer(s)
Contributor(s)
Producer
Programming Language
  • C++
Continuous Integration Tests
None
Operating System
  • POSIX
Software dependencies
  • ticcutils
  • libxml2
  • icu
  • libfolia
Metadata validation
★ ★ ★ ☆ ☆