We are pleased to announce release 5.0.0 of Coptic Scriptorium! Our data now includes over 1,288,229 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works.
This release also marks the introduction of Bohairic Coptic data to our corpus holdings: the repository now contains Bohairic Bible materials, covering Mark 1-16 and 1 Cor. 1-16, with manually reviewed segmentation for the entire corpus, and manual tagging and treebanking for chapters 1-5 in each book. Segmentation and tagging were reviewed in collaboration with Nicholas Wagner, and treebanking was done in collaboration with Nina Speranskaja. As a result of this work, we are in the process of compiling new NLP tools and guidelines specifically for Bohairic.
In addition, the release includes corrections and updates to existing corpora as well as the addition of several new Sahidic works and documents:
A. Sections of five works by Shenoute of Atripe:
B. New documents were added to existing works:
C. Newly added translation spans for Pistis Sophia, aligned by Randy Komforty
These join the newly treebanked and tagged Bohairic data, which can be found here:
We are very grateful to all of our collaborators and contributors, without whom this project could not function. We welcome Nicholas Wagner to the team and warmly thank Randy Komforty for his work on Pistis Sophia, and Nina Sepranskaja for her treebanking work.
As with all our releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking (currently only for Sahidic), in this GitHub repository, in a variety of popular formats: https://github.com/CopticScriptorium/corpora
You can also search for complex linguistic annotations in the data using our ANNIS server – please see our tutorial here to get started with some query tips and a helpful cheat sheet: https://copticscriptorium.org/ANNIS_tutorial