Category: Release Notes (page 1 of 5)

New Corpora Release 4.4.0

Searching for Greek words in Shenoute’s So Concerning the Little Place

We are pleased to announce release 4.4.0 of Coptic Scriptorium! Our data now includes over 1,267,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of almost 100,000 tokens from the previous release). We are very grateful to all of our collaborators and contributors, without whom this project could not function.

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

We would like to thank the Marcion Project for making the underlying digitized text of Pistis Sophia available, and all of the annotators for their hard work. Tamara Siuda, Rebecca Krawiec, Philippe Zaher, and Lance Martin contributed, in addition to Amir and Carrie. As our current DHAG grant ends, we would like to give special thanks to Lance, who has been working as our DH specialist on the project since 2019, for doing an amazing job of keeping track of all the data and the various tasks he’s been in charge of over the past three years!

As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

You can also search for complex linguistic annotations in the data using our ANNIS server – please see our new tutorial here to get started with some query tips and a helpful cheat sheet:

https://copticscriptorium.org/ANNIS_tutorial

We hope this release will be useful and look forward to the next one as always!

New Corpora Release 4.3.0

The opening lines of Pistis Sophia

It is our pleasure to announce release 4.3.0 of Coptic Scriptorium corpora, which currently cover over 1,175,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works. New in this release:

Corrections and additional annotations:

  • Pilot work adding partial Arabic translations (work by Philippe Zaher)
  • Improvements and error corrections to a variety of works (including Because of You Too O Prince of Evil, Dormition of John, Book of Ruth and Homilies of Proclus)

The newly released material encompasses over 57,000 tokens of semi-automatically annotated data. We would like to give special thanks to the Marcion Project for making much of the underlying digitized text available, and the annotators whose hard work has made this release possible. As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

We hope this release will be useful and look forward to the next one!

Winter 2021 Corpora Release 4.1.0

Amulet to protect place and animals. O. Crum ST 18 (KYP T344), side-by-side with its rendition on Coptic Scriptorium. Image source.

We are pleased to announce the latest release of data from Coptic Scriptorium, version 4.1.0. The new release adds new Coptic texts and annotation additions, underscored by the application of named and non-named entity annotation to our New Testament corpus.

In total, we released approximately 40,000 tokens of manually edited text in 17 documents from new works, as well as adding material to already existing works. The new material, including more digitized data courtesy of the Marcion project, the Kyprianos Magical Text Database, and other scholars, includes:

We are especially excited to announce the first release of several magical papyri and an ostracon on the Coptic Scriptorium platform in collaboration with the Kyprianos team at the University of Würzburg:

  • Magical Texts (Korshi Dosoo, Edward O. D. Love, Markéta Preininger, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

Expansions and Improvements of existing corpora:

We have extended our semi-automatic entity annotation coverage to encompass our New Testament material (over 248,000 tokens). Entity annotations, like our other annotations, were added to these specific corpora automatically and include:

  • The classification of all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available

This addition complements the existing named and non-named entity annotations of our entire collections of Coptic corpora.

We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), each of whom put in a colossal amount of work, and the Marcion and Kyprianos projects who shared their data with us, as well as the National Endowment for the Humanities for supporting us. We are continuing to create more data and tools. Please let us know if you have any feedback!

Comprehensive Coptic Lexicon v1.2

The “Thesaurus Linguae Aegyptiae” project (“Strukturen und Transformationen des Wortschatzes der ägyptischen Sprache”, BBAW), the “Database and Dictionary of Greek Loanwords in Coptic” (DDGLC, Freie Universität Berlin), and “Coptic Scriptorium: Digital Research in Coptic Language and Literature” are pleased to announce the latest release of the “Comprehensive Coptic Lexicon”: Version 1.2. The raw data can be downloaded from: 

  • D. Burns, F. Feder, K. John, M. Kupreyev, et al. 2020-07-24. Comprehensive Coptic Lexicon: Including Loanwords from Ancient Greek, Berlin: Freie Universität Berlin, http://dx.doi.org/10.17169/refubium-27566

The processed data has been published by the 

The major new features include: 

  • Standardized use of parentheses “( )” in word forms.
  • Optimizated data structure (e.g., <sense/> element now contains a unique ID, facilitating the ongoing work on linking CCL to the databases of semantic relations such as Coptic WordNet).
  • Correction of orthographic, grammatical and semantic information of the existing entries and addition of new entries.
  • Linking to Perseus Greek morphology tool via the Greek head words. DDGLC lemma IDs are now displayed in the entry view of Coptic Dictionary Online.
  • Improved usability of the section of Greek loanwords due to exclusion or change of a number of  senses.
  • Link to attestation search for nouns filtered by entity-type (e.g., search for ⲟⲩⲟⲛ standing for a person, an animal, or an inanimate object) in Coptic Scriptorium.
  • Phrase network visualization of most common word sequences containing nouns, verbs and prepositions.

For the full description of the Version 1.2 please refer to the the “Release Notes”: https://refubium.fu-berlin.de/bitstream/handle/fub188/27813/Comprehensive_Coptic_Lexicon_v1.2_Release_Notes.pdf?sequence=5&isAllowed=y 


The Comprehensive Coptic Lexicon V 1.2 now contains 11263 entries and 31847 forms of Egyptian-Coptic and Greek-Coptic datasets. TLA, DDGLC and Coptic Scriptorium invite you to take a look at the new data and would welcome your feedback.

Summer 2020 Corpora Release 4.0.0

Place name index on data.copticscriptorium.org

It is our great pleasure to announce the latest release of data from Coptic Scriptorium, version 4.0.0. This release contains both new Coptic material and extensive additions to our suite of tools and annotations, focusing on the addition of support for entity annotation and named-entity linking across our new and old datasets. The new material, including more digitized data courtesy of the Marcion project and other scholars, includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 260,000 words of Sahidic Coptic annotated for entities, including 50,000 words of gold-standard treebanked data with manual syntactic analyses.

In addition to new texts, new tools and analyses have been added to the project:

  • Complete entity annotation, classifying all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available
    • A browseable index of people and places mentioned in the texts, also linked to Wikipedia and Google Maps and including both real and fictional entities
  • Search and visualization:
    • Search by entity type and named entity in ANNIS
    • New configurable analytic visualization which displays nested entity types, highlights named entities and links them to Wikipedia
  • Natural Language Processing
    • Automatic entity recognition is now available (work by Amir Zeldes, Lance Martin and Sichang Tu)
    • A new neural parser adapted for Coptic with higher accuracy syntactic analyses, which are deployed in ANNIS (work by Luke Gessler)
The new configurable Analytic Visualization with toggleable entity types and links

This release represents a tremendous amount of work over the past few months by the entire Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document) and the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Entities in the Coptic Treebank

entities

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

  • All data from the Coptic treebank (78 documents, approx. 46,000 words)
  • All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
  • Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
  • Entity types, divided into the following 10 classes: (English examples are provided in brackets)

 

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

  1. Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
  2. Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
  3. Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!

Universal Dependencies 2.6 released!

tree

Check out the new Universal Dependencies (UD) release V2.6! This is the twelfth release of the annotated treebanks at http://universaldependencies.org/.  The project now covers syntactically annotated corpora in 92 languages, including Coptic. The size of the Coptic Treebank is now around 43,000 words, and growing. For the latest version of the Coptic data, see our development branch here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/tree/dev. For documentation, see the UD Coptic annotation guidelines.

The inclusion of the Coptic Treebank in the UD dataset means that many standard parsers and other NLP tools trained on all well attested UD languages now support Coptic out-of-the-box, including Stanford NLP’s Stanza and UFAL’s UDPipe. Feel free to try out these libraries for your data! For optimal performance on open domain Coptic text, we still recommend our custom tool-chain, Coptic-NLP, which is highly optimized to Coptic and uses additional resources beyond the treebank. Or try it out online:

Coptic-NLP demo

 

Winter 2020 Corpora Release 3.1.0

It is our pleasure to announce a new data release, with a variety of new sources from our collaborators (including more digitized data courtesy of the Marcion and PAThs projects and other scholars). New in this release are:

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

The new material in this release includes some 78,000 tokens in 33 documents and represents a tremendous amount of work by our project members and collaborators. We would like to thank the individual contributors (which you can find in the ‘annotation’ metadata), the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools, which we plan to make available in the summer. Please let us know if you have any feedback!

Fall 2019 Corpora Release 3.0.0

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

  • Saints’ lives
    • Life of Cyrus
    • Life of Onnophrius
    • Lives of Longinus and Lucius
    • Martyrdom of Victor the General (part 2)
  •  Miscellaneous:
    • Dormition of John
    • Homilies of Proclus
    • Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

  • Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
  • Apophthegmata Patrum
  • A large number of corrections to most of our existing corpora, which are being republished in this release.

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

New release of Natural Language Processing Tools

Amir Zeldes and Luke Gessler  have spent much of the past summer improving Coptic Scriptorium’s Natural Language Processing tools, and are now happy to announce the release of Coptic-NLP V3.0.0. You can read more about what we’ve been doing and the impact on performance in our three part blog post (part 1, part 2, part 3). Some of the new improvements include:

  • A new 3 step normalization framework, which allows us to hypothetically normalize bound groups before deciding how to segment them, then normalize each segment again
  • A smart rebinding module which can handle deciding to merge split bound groups based on context (useful for processing messy texts with line-breaks mid word, or other segmentation anomalies)
  • A re-implemented segmentation algorithm which is especially better at handling ambiguous groups in context (e.g. “nau” in “peja|f na|u” vs. “nau ero|f”) and spelling variation
  • A brand new, more accurate part of speech tagger
  • Higher accuracy across tools thanks to hyperparameter optimization
  • More robust test suite to ensure new errors don’t creep in
  • Various data/lexicon/ruleset improvements and bugfixes

You can download the latest version of the tools here:

https://github.com/CopticScriptorium/coptic-nlp/

Or use our web interface, which has been updated with the latest version:

https://corpling.uis.georgetown.edu/coptic-nlp/

We appreciate your feedback and comments, and hope to release more data processed with these tools very soon!

Older posts