corpora – Coptic SCRIPTORIUM Blog

Tag: corpora (page 1 of 5)

Summer 2026 release

Please join us in celebrating the release of the latest version 6.3.0 of the Coptic Scriptorium corpora! We now have a total of over 2,380,000 words. This release provides new and updated data:

New releases:
- Bohairic Book of Jonah (manually annotated)
- Life of John Chrysostom
- Life of Hilaria
- Life of Marina
- Literary Fragments
- Pseudo-Cyril of Alexandria
Updated and expanded data and translations:
- Apophthegmata Patrum
- John of Constantinople Discourse
- Life of Phib
- Proclus Homilies
- Pseudo-Chrysostom
- 1 Corinthians (manually annotated Sahidica data)
- Works by Shenoute of Atripe:
  - Abraham our Father
  - In the Night
  - God Says Through Those Who Are His

As always, we are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder, Amir Zeldes, Nicholas Wagner, and Paul Dilley as well as Nina Speranskaja, Rebecca Krawiec, Christine Luckritz Marquis, Stephan Claassen, Philippe Zaher, Randy Komforty, and Safaa Mahfouz. We also want to thank Hany Takla and the St. Shenouda the Archimandrite Coptic Society for their collaborations and support. Additionally, we thank our donors for contributions that made much of the work on this release possible. Please consider supporting Coptic Scriptorium as we navigate the new funding environment in the USA.

The raw machine-readable data for all corpora—including morphological and syntactic annotations, as well as named entity recognition—are available as usual in our GitHub repository. Data can be downloaded in a variety of popular formats to suit your research needs.

You can read and browse entire documents in an online portal. Our corpora are also linked in entries on the Coptic Dictionary Online. For searching, including advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet. Currently, the Arabic translations are only available in ANNIS, as well.

Latest Release of Coptic Corpora

December 12, 2025 / ctschroeder / 0 Comments

We are pleased to announce the release of version 6.2.0 of the Coptic Scriptorium data! Our corpora now total 2,375,875 words. This release provides significant new annotated data in both the Bohairic and Sahidic Dialects:

New parts of works in Bohairic, including:
- The Life of Shenoute, Parts 2 & 3 (part 1 was released in September)
- The Lausiac History, Parts 2 & 3 (part 1 was released in September)
New corpora
- The Gospel of Thomas, edited from the manuscript by Paul Dilley
- The Sahidic book of Jonah (with manual edits and corrections to NLP annotations by Stephan Claassen; the automatically processed Jonah is in the Coptic OT corpus)
New documents in the following existing corpora:
- Apophthegmata Patrum
- Shenoute’s work known as Acephalous Work 22
More Arabic and English translations for documents previously published

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder, Amir Zeldes, Nicholas Wagner, and Paul Dilley as well as Nina Speranskaja, Rebecca Krawiec, Christine Luckritz Marquis, Stephan Claassen, Philippe Zaher, and Safaa Mahfouz. We also want to thank Hany Takla and the St. Shenouda the Archimandrite Coptic Society for their collaborations and support. Additionally, we thank our donors for contributions that made much of the work on this release possible. Please consider supporting Coptic Scriptorium as we navigate the new funding environment in the USA.

As with all our releases, the raw machine-readable data for all corpora—including morphological and syntactic annotations, as well as named entity recognition—are available in our GitHub repository. Data can be downloaded in a variety of popular formats to suit your research needs.

You can read and browse entire documents in an online portal. Our corpora are also linked in entries on the Coptic Dictionary Online.

For searching, including advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet. Currently, the Arabic translations are only available in ANNIS, as well.

New Grant from the St Shenouda Society

February 27, 2025 / ctschroeder / 0 Comments

The Coptic Scriptorium team is deeply grateful to the St Shenouda the Archimandrite Coptic Society for its recent grant to our project. This generous gift was made possible by donations from many members of the Society. We especially thank Hany Takla, President of the Society, for his ongoing leadership and collaboration with our project.

The funds will go to digitizing and annotating more Bohairic Coptic literature. Team member Dr. Nicholas Wagner will be able to devote more hours to Boharic.

Thank you again to the St. Shenouda Society!

New Corpora Release 6.0.0

December 6, 2024 / Nick Wagner / 0 Comments

Searching for Greek loanwords in Bohairic Habakkuk

We are pleased to announce the release of version 6.0.0 of Coptic Scriptorium! Our corpus has been dramatically expanded in this release, now exceeding 2.2 million tokens of searchable, linguistically annotated Coptic texts. Among the highlights of this update is the exponential growth of our Bohairic corpus, now comprising approximately 750,000 words and featuring translated texts such as the Bohairic Bible (Old and New Testament), as well as original works such as the Life of Isaac. This milestone brings substantial enhancements to our collections, including modern editions processed with Optical Character Recognition (OCR) technology alongside both new and updated Coptic texts.

New OCR Material and Automatic Tagging

This release includes the addition of OCR-based editions. For the first time, fully automated tagging has been applied to a selection of OCR datasets:

Version 6.0.0 also includes several newly curated corpora, reflecting a diversity of dialects, genres, and textual traditions:

More selections now with parallel Arabic translations:
- Apophthegmata Patrum (AP)
Pseudo-Theophilus:
- On Repentance and Continence
Mercurius:
- Martyrdom
- Miracles Part 1 and Part 2
- Encomium
Additions to Shenoute of Atripe’s Acephalous Work 22 (A22):
- YB 83-96
Bohairic texts:
- Old Testament (automatic processing)
- New Testament (automatic processing)
- Life of Isaac (with manual corrections)
Bohairic Bible selection manually segmented and tagged:

Collaborative Efforts and Future Directions

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder and Amir Zeldes, as well as Randy Komforty, Lydia Bremer-McCollum, Lawrence Rafferty, Nina Speranskaja, and Nicholas Wagner. We also want to thank the National Endowment for the Humanities for their ongoing support. The integration of OCR materials and the expansion of our Bohairic collection reflect ongoing efforts to enhance accessibility and analytical tools for Coptic studies. These advances also pave the way for further development of NLP tools for our users.

Accessing the Data

For advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet.

We invite you to explore this latest release and we look forward to your feedback!

New Corpora Release 4.4.0

September 30, 2022 / Amir Zeldes / 0 Comments

Searching for Greek words in Shenoute’s *So Concerning the Little Place*

We are pleased to announce release 4.4.0 of Coptic Scriptorium! Our data now includes over 1,267,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of almost 100,000 tokens from the previous release). We are very grateful to all of our collaborators and contributors, without whom this project could not function.

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

Sections of three works by Shenoute of Artipe:
New documents added to existing works:
- Acephalous Work 22
- Apophthegmata Patrum
The remaining books 2-4, as well as the postscript of Pistis Sophia, which are now added to the previously released book 1 in our online interfaces
Newly treebanked data with syntactic gold standard annotations for the Life of John the Kalybites, part 1

We would like to thank the Marcion Project for making the underlying digitized text of Pistis Sophia available, and all of the annotators for their hard work. Tamara Siuda, Rebecca Krawiec, Philippe Zaher, and Lance Martin contributed, in addition to Amir and Carrie. As our current DHAG grant ends, we would like to give special thanks to Lance, who has been working as our DH specialist on the project since 2019, for doing an amazing job of keeping track of all the data and the various tasks he’s been in charge of over the past three years!

As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

You can also search for complex linguistic annotations in the data using our ANNIS server – please see our new tutorial here to get started with some query tips and a helpful cheat sheet:

https://copticscriptorium.org/ANNIS_tutorial

We hope this release will be useful and look forward to the next one as always!

New Corpora Release 4.3.0

May 3, 2022 / Amir Zeldes / 0 Comments

The opening lines of Pistis Sophia

It is our pleasure to announce release 4.3.0 of Coptic Scriptorium corpora, which currently cover over 1,175,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works. New in this release:

The History of Eustathius and Theopiste (hagiography, annotations by Lance Martin)
Pistis Sophia, book 1 (Gnosticism, annotations by Lance Martin, Tamara Siuda, Caroline T. Schroeder and Amir Zeldes)
Life of Pisentius, part 3 (hagiography, annotations by Tamara Siuda, Lance Martin, Caroline T. Schroeder)

Corrections and additional annotations:

Pilot work adding partial Arabic translations (work by Philippe Zaher)
- Apophthegmata Patrum
- Abraham our Father by Shenoute
Improvements and error corrections to a variety of works (including Because of You Too O Prince of Evil, Dormition of John, Book of Ruth and Homilies of Proclus)

The newly released material encompasses over 57,000 tokens of semi-automatically annotated data. We would like to give special thanks to the Marcion Project for making much of the underlying digitized text available, and the annotators whose hard work has made this release possible. As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

We hope this release will be useful and look forward to the next one!

New Corpora Release 4.2.0

September 30, 2021 / Amir Zeldes / 0 Comments

**Automatic linguistic analysis and Entity Linking from I Samuel 25**

It is our pleasure to announce the latest data release from Coptic Scriptorium, version 4.2.0. This release contains both new Coptic material and additions to older datasets, as well as expanding our entity annotations and named-entity linking to all of our data, including the semi-automatically annotated Old Testament. The also means automatic updates to all of our interfaces, such as the recently added example usage functionality in the Coptic Dictionary Online, which is linked to the corpora.

The new material, including more digitized data courtesy of the Marcion project, as well as manually digitized and corrected OCR data from out of print editions includes:

Encomium of Pseudo-Celestinus on Victor (annotations by Mitchell Abrams and Lance Martin)
Encomium of Pseudo-Flavianus on Demetrius, Archbishop of Alexandria (annotations by Mitchell Abrams, Lance Martin and Amir Zeldes)
Added works by Shenoute of Atripe:
- In the Night (Canons 9, annotations by Lance Martin, Caroline T. Schroeder and Amir Zeldes)
- Because of You Too O Prince of Evil (Discourses 4, annotations by Tamara Siuda, Lance Martin and Caroline T. Schroeder)
Expansions and improvements of existing corpora:
- More Apophthegmata Patrum (work by Christine Luckritz Marquis, So Miyagawa, Caroline T. Schroeder and Amir Zeldes)
- Further material from Shenoute’s works:
  - God Says Through Those Who Are His (including parallel witnesses and new material, data courtesy of David Brakke, annotations by Rebecca Krawiec, Lance Martin, Dana Robinson, Caroline T. Schroeder)
  - Acephalous Work 22 (data courtesy of David Brakke, annotations by Elizabeth Davidson, Rebecca Krawiec, Elizabeth Platte, Caroline T. Schroeder, Amir Zeldes)
- More syntactically annotated gold treebanked data in the Coptic Treebank
- Completely re-annotated Old Testament corpus, based on the base text courtesy of the Digital Edition of the Coptic Old Testament (CoptOT) project – with improved segmentation and parsing, now complete with semi-automatic entity recognition and linking to Wikipedia entries for people and places

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 300,000 words of Sahidic Coptic annotated for entities.

This release represents a tremendous amount of work over the past few months by the Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), and specifically So Miyagawa for help with Coptic OCR models, as well as the Marcion and CoptOT project for sharing their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Winter 2021 Corpora Release 4.1.0

April 2, 2021 / Lance Martin / 0 Comments

Amulet to protect place and animals. O. Crum ST 18 (KYP T344), side-by-side with its rendition on Coptic Scriptorium. Image source.

We are pleased to announce the latest release of data from Coptic Scriptorium, version 4.1.0. The new release adds new Coptic texts and annotation additions, underscored by the application of named and non-named entity annotation to our New Testament corpus.

In total, we released approximately 40,000 tokens of manually edited text in 17 documents from new works, as well as adding material to already existing works. The new material, including more digitized data courtesy of the Marcion project, the Kyprianos Magical Text Database, and other scholars, includes:

Life of John the Kalybites, parts 1 and 2 (annotations by Lance Martin, Tamara, Siuda, and Caroline T. Schroeder)
Mysteries of John the Evangelist, parts 1 and 2 (Mitchell Abrams, Lance Martin, Tamara Siuda, Caroline T. Schroeder)
Pseudo-Ephrem, The Asketicon of Apa Ephrem, parts 1 and 2 (Lance Martin and Caroline T. Schroeder)
Pseudo-Timothy of Alexandria Discourses, Discourse on Abbaton, parts 1 and 2 (Elizabeth Davidson, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

We are especially excited to announce the first release of several magical papyri and an ostracon on the Coptic Scriptorium platform in collaboration with the Kyprianos team at the University of Würzburg:

Magical Texts (Korshi Dosoo, Edward O. D. Love, Markéta Preininger, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

Expansions and Improvements of existing corpora:

Apa Johannes Canons (Diliana Atanassova, Caroline T. Schroeder, Lance Martin, and Amir Zeldes)
Apophthegmata Patrum (Marina Ghaly, Christine Luckritz Marquis, Caroline T. Schroeder)

We have extended our semi-automatic entity annotation coverage to encompass our New Testament material (over 248,000 tokens). Entity annotations, like our other annotations, were added to these specific corpora automatically and include:

The classification of all non-pronominal references to people, places and other entities into 10 entity categories
Entity linking:
- Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available

This addition complements the existing named and non-named entity annotations of our entire collections of Coptic corpora.

We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), each of whom put in a colossal amount of work, and the Marcion and Kyprianos projects who shared their data with us, as well as the National Endowment for the Humanities for supporting us. We are continuing to create more data and tools. Please let us know if you have any feedback!

Summer 2020 Corpora Release 4.0.0

August 31, 2020 / Amir Zeldes / 0 Comments

Place name index on data.copticscriptorium.org

It is our great pleasure to announce the latest release of data from Coptic Scriptorium, version 4.0.0. This release contains both new Coptic material and extensive additions to our suite of tools and annotations, focusing on the addition of support for entity annotation and named-entity linking across our new and old datasets. The new material, including more digitized data courtesy of the Marcion project and other scholars, includes:

John of Constantinople, on Penitence and Abstinence (annotations by Mitchell Abrams, Lance Martin and Amir Zeldes)
Pseudo-Chrysostom: (Elizabeth Davidson, Mitchell Abrams, Lance Martin, Amir Zeldes)
- On the Canaanite Woman
- On Susanna
Pseudo-Basil of Caesarea, on the End of the World and the Temple of Solomon (Lance and Amir)
Life of Pisentius, parts 1-2 (Lance and Amir)
Expansions and improvements of existing corpora:
- More Apophthegmata Patrum (Hayley Curtis, Elizabeth Davidson, Duncan Feiges, Elizabeth Platte, Caroline T. Schroeder, Amir Zeldes)
- Further material from Shenoute’s works:
  - God Says Through Those Who Are His (including parallel witnesses and new material, data courtesy of David Brakke, annotations by Rebecca Krawiec, Lance Martin, Dana Robinson, Caroline T. Schroeder)
  - Some Kinds of People Sift Dirt (data courtesy of David Brakke, annotations by Christine Luckritz Marquis, Caroline T. Schroeder, Amir Zeldes)

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 260,000 words of Sahidic Coptic annotated for entities, including 50,000 words of gold-standard treebanked data with manual syntactic analyses.

In addition to new texts, new tools and analyses have been added to the project:

Complete entity annotation, classifying all non-pronominal references to people, places and other entities into 10 entity categories
Entity linking:
- Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available
- A browseable index of people and places mentioned in the texts, also linked to Wikipedia and Google Maps and including both real and fictional entities
Search and visualization:
- Search by entity type and named entity in ANNIS
- New configurable analytic visualization which displays nested entity types, highlights named entities and links them to Wikipedia
Natural Language Processing
- Automatic entity recognition is now available (work by Amir Zeldes, Lance Martin and Sichang Tu)
- A new neural parser adapted for Coptic with higher accuracy syntactic analyses, which are deployed in ANNIS (work by Luke Gessler)

The new configurable Analytic Visualization with toggleable entity types and links

This release represents a tremendous amount of work over the past few months by the entire Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document) and the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Entities in the Coptic Treebank

June 1, 2020 / Amir Zeldes / 0 Comments

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

All data from the Coptic treebank (78 documents, approx. 46,000 words)
All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
Entity types, divided into the following 10 classes: (English examples are provided in brackets)

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!

Coptic SCRIPTORIUM Blog

Tag: corpora (page 1 of 5)

Summer 2026 release

Latest Release of Coptic Corpora

New Grant from the St Shenouda Society

New Corpora Release 6.0.0

New Corpora Release 4.4.0

New Corpora Release 4.3.0

New Corpora Release 4.2.0

Winter 2021 Corpora Release 4.1.0

Summer 2020 Corpora Release 4.0.0

Entities in the Coptic Treebank

Recent Posts

Categories

Tags

Follow us on Twitter

Recent Posts

Categories

Meta

Tags

Follow us on Twitter

Meta

Tag: corpora (page 1 of 5)

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Categories

Tags

Follow us on Twitter

Recent Posts

Categories

Meta

Tags

Follow us on Twitter

Meta