Tag: entities

Winter 2021 Corpora Release 4.1.0

Amulet to protect place and animals. O. Crum ST 18 (KYP T344), side-by-side with its rendition on Coptic Scriptorium. Image source.

We are pleased to announce the latest release of data from Coptic Scriptorium, version 4.1.0. The new release adds new Coptic texts and annotation additions, underscored by the application of named and non-named entity annotation to our New Testament corpus.

In total, we released approximately 40,000 tokens of manually edited text in 17 documents from new works, as well as adding material to already existing works. The new material, including more digitized data courtesy of the Marcion project, the Kyprianos Magical Text Database, and other scholars, includes:

We are especially excited to announce the first release of several magical papyri and an ostracon on the Coptic Scriptorium platform in collaboration with the Kyprianos team at the University of Würzburg:

  • Magical Texts (Korshi Dosoo, Edward O. D. Love, Markéta Preininger, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

Expansions and Improvements of existing corpora:

We have extended our semi-automatic entity annotation coverage to encompass our New Testament material (over 248,000 tokens). Entity annotations, like our other annotations, were added to these specific corpora automatically and include:

  • The classification of all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available

This addition complements the existing named and non-named entity annotations of our entire collections of Coptic corpora.

We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), each of whom put in a colossal amount of work, and the Marcion and Kyprianos projects who shared their data with us, as well as the National Endowment for the Humanities for supporting us. We are continuing to create more data and tools. Please let us know if you have any feedback!

Summer 2020 Corpora Release 4.0.0

Place name index on data.copticscriptorium.org

It is our great pleasure to announce the latest release of data from Coptic Scriptorium, version 4.0.0. This release contains both new Coptic material and extensive additions to our suite of tools and annotations, focusing on the addition of support for entity annotation and named-entity linking across our new and old datasets. The new material, including more digitized data courtesy of the Marcion project and other scholars, includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 260,000 words of Sahidic Coptic annotated for entities, including 50,000 words of gold-standard treebanked data with manual syntactic analyses.

In addition to new texts, new tools and analyses have been added to the project:

  • Complete entity annotation, classifying all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available
    • A browseable index of people and places mentioned in the texts, also linked to Wikipedia and Google Maps and including both real and fictional entities
  • Search and visualization:
    • Search by entity type and named entity in ANNIS
    • New configurable analytic visualization which displays nested entity types, highlights named entities and links them to Wikipedia
  • Natural Language Processing
    • Automatic entity recognition is now available (work by Amir Zeldes, Lance Martin and Sichang Tu)
    • A new neural parser adapted for Coptic with higher accuracy syntactic analyses, which are deployed in ANNIS (work by Luke Gessler)
The new configurable Analytic Visualization with toggleable entity types and links

This release represents a tremendous amount of work over the past few months by the entire Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document) and the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

A bird’s eye view of Coptic entities

Coptic Scriptorium recently annotated its Treebank for entities and will soon use automated tools to annotate all corpora. Entity recognition provides a window into what a text discusses, allowing readers to discover information about people and places of interest found throughout a large number of texts that they could not possibly read exhaustively. The Coptic Scriptorium team has developed a number of tools to visualize and search for entities, which you can browse here:

https://copticscriptorium.org/entities/breakdown.html

Already, we are seeing some interesting trends. Let’s take a closer look!

TreeMapping

Entities are divided into two broad categories—named and non-named. Named entities are headed by a proper noun, e.g., “Apa Pamoun” or “Scetis.” Non-named entities, which constitute the majority of annotations, are headed by a common noun, e.g., “the monk” or “the monastery.” All entities, whether named or non-named, have one of ten entity types, such as ‘person’, ‘place’ and more (see our previous post). In the image below, we see the TreeMap of unnamed places. With the nested view of data such as this, one can easily see patterns that may be missed when viewing the information in another format.

A TreeMap of unnamed places

Let’s look at the TreeMap data for non-named place entities. The desert holds an unparalleled place (no pun intended) in Coptic literature, but what exactly do Coptic texts say about it? One click of the mouse would show all eighteen mentions of entities headed by ϫⲁⲓⲉ ‘desert’ (see image below). We can see every instance of the word on the same screen and are able to compare usages. Another search would do the same for all references headed, i.e., no adjectival usages included, by the Greek word ⲉⲣⲏⲙⲟⲥ ‘desert.’ If you want to continue this line of inquiry and read every single instance of ‘desert’ in its larger context, a search for these entities in ANNIS (this function is coming soon) would display every mention in the Coptic corpora, allowing one to quickly see the texts in which these words appear and how they are used.

A TreeMap of ϫⲁⲓⲉ ‘desert’

Entity Term Networking

Entity Term Networks provide a graphic visualization of an entity’s relationships with other words in its span. For an example, let’s look at ⲙⲁ ‘place.’ From the outset, we see that ⲙⲁ is used with a wide variety of determiners and is followed by an even wider variety of constructions, but we simultaneously see that attributive adjectives, such as ⲙⲁ ⲛϣ(ⲱ)ⲱⲡⲉ ‘dwelling place, monk’s cell,’ are more commonly used with ⲙⲁ than relative or genitive constructions. The entity network for ⲙⲁ gives us a clearer idea of its potential semantic relationships: almost always followed by ⲛ ‘of’, continuing to nouns indicating purpose (place of dwelling, lavatory with ⲣⲙⲏ ‘urination’), events (ϣⲉⲗⲉⲉⲧ ‘wedding’), directions (ϣⲁ ‘East’) and more. Try pulling up the network for other Coptic nouns by yourself! As with the TreeMap, the network presents a large amount of data in a small space, revealing patterns and their relative frequency more readily.

Lexical network of words in entities headed by ⲙⲁ ‘place’

Entity Type Proportions

Entity proportions compare entity types among the corpora, visualizing them with a ratio. An average ratio is provided for all Coptic corpora and for a sample of English fiction, so viewers can see how far any given corpus departs from either baseline. The chart below sets the ratio of animals and people side by side. If you are interested in late-antique animals, you may be a little disappointed⁠—they only appear sparsely in the corpora. Any other combination juxtaposing entity types is possible. After looking through the data, it is clear that the Coptic average has a consistently higher ratio of abstract entities than the English fiction counterpart, perhaps representative of the monastic origin of much of its corpora.

Entity Type Proportion of Animals and People

Named/Non-Named by Corpus

The last visualization compares the ratio between named and non-named entities in each corpus. Once again, there is much variation between individual works, including those of the same genre (cf. The Life of Cyrus and The Life of Onnophrius), but the ratio dissimilitude may indicate where differences in content lie, pointing the way toward further research: this surprising difference between saints’ lives may merit more attention.

What’s Next?

Entity annotation makes detailed philological, literary, and historical inquiries from a large number of documents possible by enabling analysis of texts based on the quantity, proportion and dispersion of entity types. They allow us to describe texts on a level of ‘who did what to whom’ and abstract away from individual ways of phrasing references to people and places. We’re looking forward to releasing more tools and data for working with Coptic entities!

Entities in the Coptic Treebank

entities

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

  • All data from the Coptic treebank (78 documents, approx. 46,000 words)
  • All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
  • Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
  • Entity types, divided into the following 10 classes: (English examples are provided in brackets)

 

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

  1. Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
  2. Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
  3. Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!