Tag: treebank

New Corpora Release 4.2.0

Automatic linguistic analysis and Entity Linking from I Samuel 25

It is our pleasure to announce the latest data release from Coptic Scriptorium, version 4.2.0. This release contains both new Coptic material and additions to older datasets, as well as expanding our entity annotations and named-entity linking to all of our data, including the semi-automatically annotated Old Testament. The also means automatic updates to all of our interfaces, such as the recently added example usage functionality in the Coptic Dictionary Online, which is linked to the corpora.

The new material, including more digitized data courtesy of the Marcion project, as well as manually digitized and corrected OCR data from out of print editions includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 300,000 words of Sahidic Coptic annotated for entities.

This release represents a tremendous amount of work over the past few months by the Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), and specifically So Miyagawa for help with Coptic OCR models, as well as the Marcion and CoptOT project for sharing their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Entities in the Coptic Treebank

entities

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

  • All data from the Coptic treebank (78 documents, approx. 46,000 words)
  • All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
  • Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
  • Entity types, divided into the following 10 classes: (English examples are provided in brackets)

 

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

  1. Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
  2. Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
  3. Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!

Universal Dependencies 2.6 released!

tree

Check out the new Universal Dependencies (UD) release V2.6! This is the twelfth release of the annotated treebanks at http://universaldependencies.org/.  The project now covers syntactically annotated corpora in 92 languages, including Coptic. The size of the Coptic Treebank is now around 43,000 words, and growing. For the latest version of the Coptic data, see our development branch here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/tree/dev. For documentation, see the UD Coptic annotation guidelines.

The inclusion of the Coptic Treebank in the UD dataset means that many standard parsers and other NLP tools trained on all well attested UD languages now support Coptic out-of-the-box, including Stanford NLP’s Stanza and UFAL’s UDPipe. Feel free to try out these libraries for your data! For optimal performance on open domain Coptic text, we still recommend our custom tool-chain, Coptic-NLP, which is highly optimized to Coptic and uses additional resources beyond the treebank. Or try it out online:

Coptic-NLP demo

 

Corpora release 2.6

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

  • Expanded Coptic Old Testament
  • More gold-standard treebanked texts
  • Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
  • New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen.  All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

  1. analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
  2. chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
  3. ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size.  If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know.  Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments.  Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus.  (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research.  New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor.  This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works.  Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers.  We recognize that versification is arbitrary, but nonetheless useful for citation.  For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.)  They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis.  Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents.  Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order.  (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated?  The answer is complicated — almost everything is machine annotated, with different levels of scholarly review.  So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

  • Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
  • Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
  • Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

  • automatic: fully machine annotated; no manual review or corrections to the tool output
  • checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
  • gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

 

Screenshot of document metadata showing checked word segmentation and tagging

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

Screenshot of document metadata showing gold level annotations

 

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release.  Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

 

We hope you enjoy!

 

Coptic Treebank 2.2 – moving us to better parsing!

With the data release of Universal Dependencies 2.2, an update to the Coptic Treebank is now online! Thanks to work by Mitchell Abrams and Liz Davidson we’ve been able to add the first three chapters from 1 Corinthians and make numerous corrections. Another three chapters of 1 Corinthians and a portion of the Martyrdom of Victor the General are coming soon. You can see how we’ve been annotating and the documentation of our guidelines here:

http://universaldependencies.org/cop/

Thanks to the new data, automatic parsing has become somewhat more reliable, allowing us to add automatic parses to the most recent release. The results are better than before, but note we still only expect around 90% accuracy. To illustrate where the computer can’t do what humans can, here are two examples of a verb governing a subordinate verb in a clause marked by Ϫⲉ ‘that’. The subordinate verb usually has one of two labels:

  • ccomp if it’s a complement clause (I said that…)
  • advcl if it’s an adverbial clause, such as a causal clause (Ϫⲉ  meaning ‘because’).

One of these examples was done by a human who got things right, the other contains a parser error – see if you can spot which is which!

 

New release of the Coptic Treebank

Coptic Treebank release 2.1, now with three Letters of Besa!

We are pleased to announce the release of the latest version of the Coptic Treebank, now containing three Letters of Besa:

  • On Lack of Food
  • To Aphthonia
  • To Thieving Nuns

This brings the total corpus size up to 10,499 tokens, thanks to annotation work by Elizabeth Davidson and Amir Zeldes, building on earlier transcription and tagging work by Coptic Scriptorium and KELLIA partners. Special thanks are due to So Miyagawa for providing the transcription for On Lack of Food. The corpus will continue to grow as we work to annotate more data and improve the accuracy of our automatic syntax parser for Coptic. You can search the current version of the corpus in ANNIS here:

https://corpling.uis.georgetown.edu/annis/scriptorium

Or download the latest raw annotated data from GitHub here:

https://github.com/universalDependencies/UD_Coptic/tree/dev

Please let us know if you find any errors or have any feedback on the treebank!

New release – Coptic Treebank V2

We are happy to announce the release of version 2 of the Coptic Universal Dependency Treebank. With over 8,500 tokens from 14 documents, the Treebank is the largest syntactically annotated resource in Coptic. The annotation scheme follows the Universal Dependency Guidelines, version 2, and is therefore comparable with UD data from 70 treebanks in 50 languages, including English, Latin, Classical Greek, Arabic, Hebrew and more.

You can search in the Treebank using ANNIS. For example, the following query finds cases of verbs dominating a complement clause (e.g. “say …. that …”):

pos="V" ->dep[func="ccomp"] norm

[Link to this query]