Tag: release (page 1 of 2)

New Corpora Release 4.4.0

Searching for Greek words in Shenoute’s So Concerning the Little Place

We are pleased to announce release 4.4.0 of Coptic Scriptorium! Our data now includes over 1,267,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of almost 100,000 tokens from the previous release). We are very grateful to all of our collaborators and contributors, without whom this project could not function.

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

We would like to thank the Marcion Project for making the underlying digitized text of Pistis Sophia available, and all of the annotators for their hard work. Tamara Siuda, Rebecca Krawiec, Philippe Zaher, and Lance Martin contributed, in addition to Amir and Carrie. As our current DHAG grant ends, we would like to give special thanks to Lance, who has been working as our DH specialist on the project since 2019, for doing an amazing job of keeping track of all the data and the various tasks he’s been in charge of over the past three years!

As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

You can also search for complex linguistic annotations in the data using our ANNIS server – please see our new tutorial here to get started with some query tips and a helpful cheat sheet:

https://copticscriptorium.org/ANNIS_tutorial

We hope this release will be useful and look forward to the next one as always!

New Corpora Release 4.3.0

The opening lines of Pistis Sophia

It is our pleasure to announce release 4.3.0 of Coptic Scriptorium corpora, which currently cover over 1,175,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works. New in this release:

Corrections and additional annotations:

  • Pilot work adding partial Arabic translations (work by Philippe Zaher)
  • Improvements and error corrections to a variety of works (including Because of You Too O Prince of Evil, Dormition of John, Book of Ruth and Homilies of Proclus)

The newly released material encompasses over 57,000 tokens of semi-automatically annotated data. We would like to give special thanks to the Marcion Project for making much of the underlying digitized text available, and the annotators whose hard work has made this release possible. As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

We hope this release will be useful and look forward to the next one!

New Corpora Release 4.2.0

Automatic linguistic analysis and Entity Linking from I Samuel 25

It is our pleasure to announce the latest data release from Coptic Scriptorium, version 4.2.0. This release contains both new Coptic material and additions to older datasets, as well as expanding our entity annotations and named-entity linking to all of our data, including the semi-automatically annotated Old Testament. The also means automatic updates to all of our interfaces, such as the recently added example usage functionality in the Coptic Dictionary Online, which is linked to the corpora.

The new material, including more digitized data courtesy of the Marcion project, as well as manually digitized and corrected OCR data from out of print editions includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 300,000 words of Sahidic Coptic annotated for entities.

This release represents a tremendous amount of work over the past few months by the Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), and specifically So Miyagawa for help with Coptic OCR models, as well as the Marcion and CoptOT project for sharing their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Winter 2021 Corpora Release 4.1.0

Amulet to protect place and animals. O. Crum ST 18 (KYP T344), side-by-side with its rendition on Coptic Scriptorium. Image source.

We are pleased to announce the latest release of data from Coptic Scriptorium, version 4.1.0. The new release adds new Coptic texts and annotation additions, underscored by the application of named and non-named entity annotation to our New Testament corpus.

In total, we released approximately 40,000 tokens of manually edited text in 17 documents from new works, as well as adding material to already existing works. The new material, including more digitized data courtesy of the Marcion project, the Kyprianos Magical Text Database, and other scholars, includes:

We are especially excited to announce the first release of several magical papyri and an ostracon on the Coptic Scriptorium platform in collaboration with the Kyprianos team at the University of Würzburg:

  • Magical Texts (Korshi Dosoo, Edward O. D. Love, Markéta Preininger, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

Expansions and Improvements of existing corpora:

We have extended our semi-automatic entity annotation coverage to encompass our New Testament material (over 248,000 tokens). Entity annotations, like our other annotations, were added to these specific corpora automatically and include:

  • The classification of all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available

This addition complements the existing named and non-named entity annotations of our entire collections of Coptic corpora.

We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), each of whom put in a colossal amount of work, and the Marcion and Kyprianos projects who shared their data with us, as well as the National Endowment for the Humanities for supporting us. We are continuing to create more data and tools. Please let us know if you have any feedback!

Summer 2020 Corpora Release 4.0.0

Place name index on data.copticscriptorium.org

It is our great pleasure to announce the latest release of data from Coptic Scriptorium, version 4.0.0. This release contains both new Coptic material and extensive additions to our suite of tools and annotations, focusing on the addition of support for entity annotation and named-entity linking across our new and old datasets. The new material, including more digitized data courtesy of the Marcion project and other scholars, includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 260,000 words of Sahidic Coptic annotated for entities, including 50,000 words of gold-standard treebanked data with manual syntactic analyses.

In addition to new texts, new tools and analyses have been added to the project:

  • Complete entity annotation, classifying all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available
    • A browseable index of people and places mentioned in the texts, also linked to Wikipedia and Google Maps and including both real and fictional entities
  • Search and visualization:
    • Search by entity type and named entity in ANNIS
    • New configurable analytic visualization which displays nested entity types, highlights named entities and links them to Wikipedia
  • Natural Language Processing
    • Automatic entity recognition is now available (work by Amir Zeldes, Lance Martin and Sichang Tu)
    • A new neural parser adapted for Coptic with higher accuracy syntactic analyses, which are deployed in ANNIS (work by Luke Gessler)
The new configurable Analytic Visualization with toggleable entity types and links

This release represents a tremendous amount of work over the past few months by the entire Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document) and the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Entities in the Coptic Treebank

entities

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

  • All data from the Coptic treebank (78 documents, approx. 46,000 words)
  • All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
  • Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
  • Entity types, divided into the following 10 classes: (English examples are provided in brackets)

 

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

  1. Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
  2. Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
  3. Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!

Winter 2020 Corpora Release 3.1.0

It is our pleasure to announce a new data release, with a variety of new sources from our collaborators (including more digitized data courtesy of the Marcion and PAThs projects and other scholars). New in this release are:

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

The new material in this release includes some 78,000 tokens in 33 documents and represents a tremendous amount of work by our project members and collaborators. We would like to thank the individual contributors (which you can find in the ‘annotation’ metadata), the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools, which we plan to make available in the summer. Please let us know if you have any feedback!

Fall 2019 Corpora Release 3.0.0

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

  • Saints’ lives
    • Life of Cyrus
    • Life of Onnophrius
    • Lives of Longinus and Lucius
    • Martyrdom of Victor the General (part 2)
  •  Miscellaneous:
    • Dormition of John
    • Homilies of Proclus
    • Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

  • Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
  • Apophthegmata Patrum
  • A large number of corrections to most of our existing corpora, which are being republished in this release.

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

Spring 2019 Corpora Release 2.7.0

We at Coptic Scriptorium are pleased to version 2.7.0 of our corpora.  The release includes several new documents:

  • several more sayings in the Coptic Apophthegmata Patrum (edited & annotated by Marina Ghaly)
  • additional fragments of Shenoute’s sermon Some Kinds of People Sift Dirt (edited & annotated by Christine Luckritz Marquis, editions provided by David Brakke)
  • Besa’s letter On Vigilance (edited and annotated by So Miyagawa and others)
  • several more fragments of the monastic canons of Apa Johannes (annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora at https://corpling.uis.georgetown.edu/annis/scriptorium and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format).

Our total annotated corpora are now at over 780,000 words; corpora that have human editors who reviewed the machine annotations amount to over 100,000 words.

Enjoy!

Corpora release 2.6

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

  • Expanded Coptic Old Testament
  • More gold-standard treebanked texts
  • Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
  • New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen.  All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

  1. analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
  2. chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
  3. ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size.  If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know.  Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments.  Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus.  (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research.  New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor.  This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works.  Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers.  We recognize that versification is arbitrary, but nonetheless useful for citation.  For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.)  They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis.  Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents.  Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order.  (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated?  The answer is complicated — almost everything is machine annotated, with different levels of scholarly review.  So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

  • Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
  • Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
  • Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

  • automatic: fully machine annotated; no manual review or corrections to the tool output
  • checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
  • gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

 

Screenshot of document metadata showing checked word segmentation and tagging

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

Screenshot of document metadata showing gold level annotations

 

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release.  Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

 

We hope you enjoy!

 

Older posts