Tag: tokenizer

Coptic NLP pipeline Part 2

With the creation of the Coptic NLP (Natural Language Processor) pipeline by Amir Zeldes, it is now possible to run all our NLP tools simultaneously without the need to individually download and run them. The web application will tokenize bound groups into words, and will normalize the spelling of words and diacritics. It will also tag for part-of-speech, lemmatize, and tag for language of origin for borrowed (foreign) words. The interface is XML tolerant (preserves tags in the input) and the output is tagged in SGML. One of the options is to encode the lines breaks in a word or sentence which is useful for encoding manuscripts. However, keep in mind to double check results because the interface is still in the beta stage.

As an example, the screenshot below is a snippet from I See Your Eagerness from manuscript MONB.GL29.



Notice it contains an XML tag to encode a letter as “large ekthetic”. “Large ekthetic” corresponds to the alpha letter to designate it as a large character in the left margin of the manuscript’s column of text.  This tag will be preserved in the output.


The results are shown above. Bounds group are shown and along with the part of speech tag abbreviated as “pos”. The snippet from I See Your Eagerness has also been lemmatized, shown as “lemma”. Also, near the bottom of the screenshot, the language of origin of borrowed (foreign) words in the snippet has been identified as “Greek”.  These tags also correspond to the annotation layers you see in our multi-layer search and visualization tool ANNIS.

We hope the NLP service serves you well.


New Coptic NLP pipeline

The entire tool chain of Coptic Natural Language Processing has been difficult to get running for some: it involves a bunch of command line tools, and special attention needed to be paid to coordinating word division expectations between the tools (normalization, tagging, language of origin detection). In order to make this process simpler, we now offer a Web interface that let’s you paste in Coptic text and run all tools on the input automatically, without installing anything. You can find the interface here:


The pipeline is XML tolerant (preserves tags in the input) and there’s also a machine actionable API version for external software to use these resources. Please let the Scriptorium team know if you’re using the pipeline and/or run into any problems.


Happy processing!

Entire Sahidica New Testament now available

The entire Sahidica New Testament (machine-annotated) is now available. It has been tokenized and tagged for part of speech entirely automatically, using our tools. There has been no manual editing or correction. Visit our corpora for more information, or just jump in and search it in ANNIS.


(Originally posted in March 2015 at http://copticscriptorium.org/)

Release of the updated tokenizer

The tokenizer has been updated! Version 3.0 is now on GitHub.  It has introduced a training data component that learns from our annotators’ most common tokenization and correction practices.  The tokenizer breaks Coptic text segmented as bound groups into morphemes for analysis/annotation

(Originally posted on copticscriptorium.org on 5/22/15.)

© 2017

Theme by Anders NorenUp ↑