Colibri Core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.

Provided tools & services

colibri-classdecode

Decodes a binary encoded corpus and a class file to a plain text corpus
Type
  • Command-line Application
Executable name
colibri-classdecode

colibri-classencode

Encodes a plain text corpus to a binary encoded corpus and a class file
Type
  • Command-line Application
Executable name
colibri-classencode

colibri-cooc

Computes co-occurrence statistics (absolute co-cooccurrence or pointwise mutual information) between patterns in a corpus
Type
  • Command-line Application
Executable name
colibri-cooc

colibri-coverage

Computes the coverage of training/background corpus on a particular test/foreground corpus, i.e how many of the patterns in the test corpus were found during training, how many tokens are covered, and how is this all distributed?. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-coverage

colibri-findpatterns

Find patterns in corpus data based on a presupplied list of patterns (one per line). This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-findpatterns

colibri-freqlist

Extract n-grams (and optionally skipgrams) with their counts from one or more plain-text corpus files. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-freqlist

colibri-histogram

Computes a histogram for ngram occurrences (and optionally skipgrams) in the corpus. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-histogram

colibri-loglikelihood

Compares the frequency of patterns between two or more corpus files (plain text) by computing log likelihood, following the methodology of Rayson and Garside (2000), Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1 - 6: http://www.comp.lancs.ac.uk/~paul/publications/rg_acl2000.pdf. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-loglikelihood

colibri-ngrams

Extract n-grams of a particular size by moving a sliding window over the corpus. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-ngrams

colibri-ngramstats

Computes a summary report on the count of ngrams (and optionally skipgrams) in the corpus. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-ngramstats

colibri-patternmodeller

Extract, model and compare recurring patterns (n-grams, skipgrams, flexgrams) and their frequencies in text corpus data. This is the main tool of Colibri Core.
Type
  • Command-line Application
Executable name
colibri-patternmodeller

colibri-queryngrams

Interactive command line tool to n-grams with their counts from one or more plain-text corpus files. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-queryngrams

colibri-reverseindex

Computes and prints reverse index of the corpus, for each token position in the corpus, all patterns that start at that position are shown. This is a high-level convenience script over underlying tools.
Type
  • Command-line Application
Executable name
colibri-reverseindex

References

Citation

Please use one of the above reference publications to cite the software, if you want to cite the software directly, you can use the following citation generated from the metadata:

Logs & Reviews

Name
Automatic software metadata validation report for Colibri Core 2.5.9
Author
  • codemetapy validator using software.ttl
Date
2024-10-12 03:03:53
Review
Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems

Validation of Colibri Core 2.5.9 was successful (score=3/5), but there are some warnings which should be addressed:

1. Warning: Documentation *SHOULD* be expressed (The metadata does express this currently, but something is wrong in the way it is expressed. Is the type/class valid?)
2. Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)
3. Info: The technology readiness level *SHOULD* be expressed (This is missing in the metadata)
Rating
★ ★ ★ ☆ ☆
(log file starts at Sat Oct 12 03:03:52 UTC 2024)

[harvester info] --> Processing colibri-core (https://github.com/proycon/colibri-core) [Sat Oct 12 03:03:52 UTC 2024]

[harvester info] Git updating cached clone of https://github.com/proycon/colibri-core...

[harvester info] Found release v2.5.9

[harvester info] Using 'v2.5.9'

[harvester info] Git reference: v2.5.9

[harvester info] Scanning directory /tmp/codemeta-harvester.cache/colibri-core for harvestable resources...

[harvester info] found codemeta.json for colibri-core (md5sum 7f83585f08e66696689283ac15123c69); **NOTE: this is considered authoritative and most other detection methods will be skipped now!**

[harvester info] Inferring repostatus information from git activity (used only as a fallback if not explicitly provided)...

[harvester info] Inferred repostatus https://www.repostatus.org/#inactive

[harvester info] Looking for repostatus information in README.md in master branch...

[harvester info] Found repostatus (master branch) https://www.repostatus.org/#active

[harvester info] Reconciliating: codemetapy  --baseuri https://tools.clariah.nl --baseuri https://tools.clariah.nl --includecontext --addcontext https://w3id.org/nwo-research-fields --addcontext https://w3id.org/research-technology-readiness-levels --addcontextgraph https://vocabs.dariah.eu/rest/v1/tadirah/data?format=text/turtle --trl --identifier "colibri-core" --codeRepository "https://github.com/proycon/colibri-core" --validate /etc/software.ttl --released --enrich --textv "Please consult the CLARIAH Software Metadata Requirements at https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md for an in-depth explanation of any found problems" -O /tmp/out/colibri-core.codemeta.json /tmp/codemeta-harvester.cache//tmp/99-repostatus.colibri-core.codemeta.json /tmp/codemeta-harvester.cache//tmp/10-jsonld.colibri-core.codemeta.json /tmp/codemeta-harvester.cache//tmp/05-repostatus.colibri-core.codemeta.json 

-- begin log --

Passed 3 files/sources but specified 0 input types! Automatically guessing types...

Detected input types: [('/tmp/codemeta-harvester.cache//tmp/99-repostatus.colibri-core.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/10-jsonld.colibri-core.codemeta.json', 'json'), ('/tmp/codemeta-harvester.cache//tmp/05-repostatus.colibri-core.codemeta.json', 'json')]

Adding to contextgraph: /tmp/turtle

Initial URI automatically generated, may be overriden later: https://tools.clariah.nl/colibri-core

Processing source #1 of 3

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/99-repostatus.colibri-core.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/colibri-core

[CODEMETA COMPOSITION (https://tools.clariah.nl/colibri-core)] processed 1 new triples, total is now 2

Processing source #2 of 3

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/10-jsonld.colibri-core.codemeta.json

    Injected (possibly temporary) URI https://tools.clariah.nl/colibri-core

[CODEMETA COMPOSITION (colibricore)] overriding old https://codemeta.github.io/terms/developmentStatus (https://www.repostatus.org/#inactive -> https://www.repostatus.org/#active)

[CODEMETA CORRECTION (colibricore)] automatically converting spdx license URI from https:// to http:///

[CODEMETA COMPOSITION (colibricore)] processed 129 new triples, total is now 129

Processing source #3 of 3

Parsing json-ld file from /tmp/codemeta-harvester.cache//tmp/05-repostatus.colibri-core.codemeta.json

    NOTE: Not a valid JSON-LD document, @context missing! Attempting to inject automatically...

    Injected (possibly temporary) URI https://tools.clariah.nl/colibri-core

[CODEMETA COMPOSITION (colibricore)] processed 1 new triples, total is now 129

Remapping URI to (possibly) new identifier and version component: https://tools.clariah.nl/colibri-core -> https://tools.clariah.nl/colibri-core/2.5.9

[CODEMETA VALIDATION (colibri-core)] done

[CODEMETA ENRICHMENT (colibri-core)] adding author https://orcid.org/0000-0002-1046-0006 as contributor

[CODEMETA ENRICHMENT (colibri-core)] considering first author as maintainer

[CODEMETA ENRICHMENT (colibri-core)] adding affiliation(s) of first author as producer

VALIDATION https://tools.clariah.nl/colibri-core/2.5.9 #1: Warning: Documentation *SHOULD* be expressed (The metadata does express this currently, but something is wrong in the way it is expressed. Is the type/class valid?)

VALIDATION https://tools.clariah.nl/colibri-core/2.5.9 #2: Info: The funder *SHOULD* be acknowledged (This is missing in the metadata)

VALIDATION https://tools.clariah.nl/colibri-core/2.5.9 #3: Info: The technology readiness level *SHOULD* be expressed (This is missing in the metadata)

-- end log --

[harvester info] Output written to /tmp/out/colibri-core.codemeta.json

[harvester info] <-- Finished processing colibri-core (https://github.com/proycon/colibri-core) [Sat Oct 12 03:03:54 UTC 2024]

        

Metadata Properties

Version
2.5.9 (release notes)
Interface types
  • Command-line Application
Software website
Source code repository
 https://github.com/proycon/colibri-core  Stars are an indicator of the popularity of this project on GitHub
Keywords
  • language modelling
  • natural language processing
  • ngrams
  • nlp
  • pattern recognition
  • skipgrams
Development Status
  • Active: The project has reached a stable, usable state and is being actively developed.
Issue Tracker (Support)
https://github.com/proycon/colibri-core/issues  The number of open issues on the issue tracker  The number of closes issues on the issue tracker
Documentation
License
Author(s)
Maintainer(s)
Contributor(s)
Producer
Programming Language
  • C++
  • Cython
Continuous Integration Tests
https://travis-ci.org/proycon/colibri-core
Operating System
  • BSD
  • Linux
  • macOS
Metadata validation
★ ★ ★ ☆ ☆
Created
2013-09-15