ucto

ucto
Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.

Provided tools & services

libucto

Ucto Library with API for C++
Type
  • Software Library
Executable name
libucto

ucto

Command-line interface to the tokenizer
Type
  • Command-line Application
Executable name
ucto
Input data
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Frisian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Spanish
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
English
Type
TextDigitalDocument
Encoding Format
text/plain
Language
German
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Italian
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
Dutch
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Dutch
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Swedish
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Italian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Portuguese
Type
TextDigitalDocument
Encoding Format
text/plain
Language
French
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Frisian
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
Spanish
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Portuguese
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Russian
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
German
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
French
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Turkish
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Russian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
English
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Swedish
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Turkish
Output data
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Portuguese
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Russian
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
Spanish
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Spanish
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Italian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Portuguese
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
English
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
French
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Dutch
Type
TextDigitalDocument
Encoding Format
text/plain
Language
German
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Swedish
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Turkish
Type
TextDigitalDocument
Encoding Format
application/folia+xml
Language
Dutch
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Swedish
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
German
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Frisian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Italian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Turkish
Type
TextDigitalDocument
Encoding Format
text/plain
Language
English
Type
TextDigitalDocument
Encoding Format
text/plain
Language
French
Type
TextDigitalDocument
Encoding Format
application/folia.xml
Language
Russian
Type
TextDigitalDocument
Encoding Format
text/plain
Language
Frisian

Tool suite: Ucto

The following closely related tools are in a tool suite together with ucto:

  • Web Application
  • 8 - Complete: Technology complete and qualified, released for all end-users in scholarly environments.
  • Active: The project has reached a stable, usable state and is being actively developed.

Ucto-Webservice 2.5.2

  •   KNAW Humanities Cluster & CLST, Radboud University
Ucto is a rule-based tokeniser for multiple languages. This is the webservice for it, for both humans and machines. [view more]
  • Annotating
  • Linguistics
  • Tagging
  • Textual and content analysis
  • clam webservice rest nlp computational_linguistics rest
  • Bsd
  • Linux
  • Macos
  • Python
Created: 2022-04-08
Modified: 2024-03-14
  • 8 - Complete: Technology complete and qualified, released for all end-users in scholarly environments.
  • Active: The project has reached a stable, usable state and is being actively developed.

python-ucto 0.6.8

  •   KNAW Humanities Cluster & CLST, Radboud University
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto). [view more]
  • Text Processing > Linguistic
  • tokenizer tokenization tokeniser tokenisation nlp computational_linguistics ucto
  • Bsd
  • Cython
  • Linux
  • Macos
  • Python
Created: 2014-05-21
Modified: 2024-09-12

Citation

You can cite this software using the following citation generated from its metadata:

Logs & Reviews

Name
Automatic software metadata validation report for ucto 0.34
Author
  • codemetapy validator using software.ttl
Date
2024-10-12 03:20:20
Review
Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems

Validation of ucto 0.34 was successful (score=4/5), but there are some remarks which you may or may not want to address:

1. Info: Reference publications *SHOULD* be expressed, if any (This is missing in the metadata)
Rating
★ ★ ★ ★ ☆
(log file starts at Sat Oct 12 03:20:15 UTC 2024)

[harvester info] --> Processing ucto (https://github.com/LanguageMachines/ucto) [Sat Oct 12 03:20:15 UTC 2024]

[harvester info] Git updating cached clone of https://github.com/LanguageMachines/ucto...

[harvester info] Found release v0.34

[harvester info] Using 'v0.34'

[harvester info] Git reference: v0.34

[harvester info] Scanning directory /tmp/codemeta-harvester.cache/ucto for harvestable resources...

[harvester info] found codemeta.json for ucto (md5sum bf921474f922e7944914f5fc3ad6db2d); **NOTE: this is considered authoritative and most other detection methods will be skipped now!**

[harvester info] Inferring repostatus information from git activity (used only as a fallback if not explicitly provided)...

[harvester info] Inferred repostatus https://www.repostatus.org/#active

[harvester info] Looking for repostatus information in README.md in master branch...

[harvester info] Found repostatus (master branch) https://www.repostatus.org/#active

[harvester info] Looking for repostatus information in README in master branch...

[harvester info] Parsing MAINTAINERS from master branch...

[harvester info] Setting group Ucto

[harvester info] Reconciliating: codemetapy  --baseuri https://tools.clariah.nl --baseuri https://tools.clariah.nl --includecontext --addcontext https://w3id.org/nwo-research-fields --addcontext https://w3id.org/research-technology-readiness-levels --addcontextgraph https://vocabs.dariah.eu/rest/v1/tadirah/data?format=text/turtle --trl --identifier "ucto" --codeRepository "https://github.com/LanguageMachines/ucto" --validate /etc/software.ttl --released --enrich --textv "Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems" -O /tmp/out/ucto.codemeta.json /tmp/codemeta-harvester.cache//tmp/99-repostatus.ucto.codemeta.json /tmp/codemeta-harvester.cache//tmp/10-jsonld.ucto.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-repostatus.ucto.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-maintainers.ucto.codemeta.json /tmp/codemeta-harvester.cache//tmp/04-applicationSuite.ucto.codemeta.json 

-- begin log --

Passed 5 files/sources but specified 0 input types! Automatically guessing types...

Detected input types: [('/tmp/codemeta-harvester.cache//tmp/99-repostatus.ucto.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/10-jsonld.ucto.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-repostatus.ucto.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-maintainers.ucto.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/04-applicationSuite.ucto.codemeta.json', 'json')]

Adding to contextgraph: /tmp/turtle

Initial URI automatically generated, may be overriden later: https://tools.clariah.nl/ucto

Processing source #1 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/99-repostatus.ucto.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/ucto

[CODEMETA COMPOSITION (https://tools.clariah.nl/ucto)] processed 1 new triples, total is now 2

Processing source #2 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/10-jsonld.ucto.codemeta.json

    Injected (possibly temporary) URI https://tools.clariah.nl/ucto

[CODEMETA CORRECTION (ucto)] automatically converting spdx license URI from https:// to http:///

[CODEMETA COMPOSITION (ucto)] processed 268 new triples, total is now 268

Processing source #3 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-repostatus.ucto.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/ucto

[CODEMETA COMPOSITION (ucto)] processed 1 new triples, total is now 268

Processing source #4 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-maintainers.ucto.codemeta.json

    Found main resource with URI https://tools.clariah.nl/maintainers/snapshot

    Injected (possibly temporary) URI https://tools.clariah.nl/ucto

[CODEMETA COMPOSITION (ucto)] processed 14 new triples, total is now 281

Processing source #5 of 5

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/04-applicationSuite.ucto.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/ucto

[CODEMETA COMPOSITION (ucto)] processed 1 new triples, total is now 282

Remapping URI to (possibly) new identifier and version component: https://tools.clariah.nl/ucto -> https://tools.clariah.nl/ucto/0.34

[CODEMETA VALIDATION (ucto)] done

[CODEMETA ENRICHMENT (ucto)] adding author https://orcid.org/0000-0002-1046-0006 as contributor

[CODEMETA ENRICHMENT (ucto)] adding author https://tools.clariah.nl/stub/H5f7a1517b194ab6b as contributor

VALIDATION https://tools.clariah.nl/ucto/0.34 #1: Info: Reference publications *SHOULD* be expressed, if any (This is missing in the metadata)

-- end log --

[harvester info] Output written to /tmp/out/ucto.codemeta.json

[harvester info] <-- Finished processing ucto (https://github.com/LanguageMachines/ucto) [Sat Oct 12 03:20:21 UTC 2024]

        

Metadata Properties

Version
0.34 (release notes)
Interface types
  • Command-line Application
  • Software Library
Software website
Source code repository
 https://github.com/LanguageMachines/ucto  Stars are an indicator of the popularity of this project on GitHub
Category
  • Annotating
  • Linguistics
  • Tagging
  • Textual and content analysis
Keywords
  • natural language processing
  • nlp
  • tokenization
  • tokenizer
Development Status
  • 9 - Proven: Technology complete and proven in practice by real users.
  • Active: The project has reached a stable, usable state and is being actively developed.
Issue Tracker (Support)
https://github.com/LanguageMachines/ucto/issues  The number of open issues on the issue tracker  The number of closes issues on the issue tracker
Documentation
License
Author(s)
Maintainer(s)
Contributor(s)
Producer
Programming Language
  • C++
Continuous Integration Tests
https://github.com/LanguageMachines/ucto/actions/workflows/ucto.yml
Operating System
  • BSD
  • Linux
  • macOS
Software dependencies
  • libfolia
  • libxml2
  • ticcutils
  • icu
Metadata validation
★ ★ ★ ★ ☆
Created
2011-03-27
Last modified
2023-02-22 12:17:06 +0100  Last commit (main branch). Gives an indication of project development activity and rough indication of how up-to-date the latest release is.  Number of commits since the last release. Gives an indication of project development activity and rough indication of how up-to-date the latest release is.