Page 3 of 10

Digital Coptic 3 – program online!

The program for the third edition of Digital Coptic is now online. Check out the workshop website for the list of projects, talks and presenters.

Please join us for the workshop on July 12 and 13 – participants will receive a Zoom link and password for interactive presentations and discussion, and the workshop will also be cast to YouTube for larger audiences and offline viewing after the workshop. We look forward to hearing all the talks!

A bird’s eye view of Coptic entities

Coptic Scriptorium recently annotated its Treebank for entities and will soon use automated tools to annotate all corpora. Entity recognition provides a window into what a text discusses, allowing readers to discover information about people and places of interest found throughout a large number of texts that they could not possibly read exhaustively. The Coptic Scriptorium team has developed a number of tools to visualize and search for entities, which you can browse here:

https://copticscriptorium.org/entities/breakdown.html

Already, we are seeing some interesting trends. Let’s take a closer look!

TreeMapping

Entities are divided into two broad categories—named and non-named. Named entities are headed by a proper noun, e.g., “Apa Pamoun” or “Scetis.” Non-named entities, which constitute the majority of annotations, are headed by a common noun, e.g., “the monk” or “the monastery.” All entities, whether named or non-named, have one of ten entity types, such as ‘person’, ‘place’ and more (see our previous post). In the image below, we see the TreeMap of unnamed places. With the nested view of data such as this, one can easily see patterns that may be missed when viewing the information in another format.

A TreeMap of unnamed places

Let’s look at the TreeMap data for non-named place entities. The desert holds an unparalleled place (no pun intended) in Coptic literature, but what exactly do Coptic texts say about it? One click of the mouse would show all eighteen mentions of entities headed by ϫⲁⲓⲉ ‘desert’ (see image below). We can see every instance of the word on the same screen and are able to compare usages. Another search would do the same for all references headed, i.e., no adjectival usages included, by the Greek word ⲉⲣⲏⲙⲟⲥ ‘desert.’ If you want to continue this line of inquiry and read every single instance of ‘desert’ in its larger context, a search for these entities in ANNIS (this function is coming soon) would display every mention in the Coptic corpora, allowing one to quickly see the texts in which these words appear and how they are used.

A TreeMap of ϫⲁⲓⲉ ‘desert’

Entity Term Networking

Entity Term Networks provide a graphic visualization of an entity’s relationships with other words in its span. For an example, let’s look at ⲙⲁ ‘place.’ From the outset, we see that ⲙⲁ is used with a wide variety of determiners and is followed by an even wider variety of constructions, but we simultaneously see that attributive adjectives, such as ⲙⲁ ⲛϣ(ⲱ)ⲱⲡⲉ ‘dwelling place, monk’s cell,’ are more commonly used with ⲙⲁ than relative or genitive constructions. The entity network for ⲙⲁ gives us a clearer idea of its potential semantic relationships: almost always followed by ⲛ ‘of’, continuing to nouns indicating purpose (place of dwelling, lavatory with ⲣⲙⲏ ‘urination’), events (ϣⲉⲗⲉⲉⲧ ‘wedding’), directions (ϣⲁ ‘East’) and more. Try pulling up the network for other Coptic nouns by yourself! As with the TreeMap, the network presents a large amount of data in a small space, revealing patterns and their relative frequency more readily.

Lexical network of words in entities headed by ⲙⲁ ‘place’

Entity Type Proportions

Entity proportions compare entity types among the corpora, visualizing them with a ratio. An average ratio is provided for all Coptic corpora and for a sample of English fiction, so viewers can see how far any given corpus departs from either baseline. The chart below sets the ratio of animals and people side by side. If you are interested in late-antique animals, you may be a little disappointed⁠—they only appear sparsely in the corpora. Any other combination juxtaposing entity types is possible. After looking through the data, it is clear that the Coptic average has a consistently higher ratio of abstract entities than the English fiction counterpart, perhaps representative of the monastic origin of much of its corpora.

Entity Type Proportion of Animals and People

Named/Non-Named by Corpus

The last visualization compares the ratio between named and non-named entities in each corpus. Once again, there is much variation between individual works, including those of the same genre (cf. The Life of Cyrus and The Life of Onnophrius), but the ratio dissimilitude may indicate where differences in content lie, pointing the way toward further research: this surprising difference between saints’ lives may merit more attention.

What’s Next?

Entity annotation makes detailed philological, literary, and historical inquiries from a large number of documents possible by enabling analysis of texts based on the quantity, proportion and dispersion of entity types. They allow us to describe texts on a level of ‘who did what to whom’ and abstract away from individual ways of phrasing references to people and places. We’re looking forward to releasing more tools and data for working with Coptic entities!

Entities in the Coptic Treebank

entities

With the release of Version 2.6 of Universal Dependencies, our focus has shifted to handling Named and Non-Named Entity Recognition (NER/NNER) in Coptic data. As a result of intensive work by the Coptic Scriptorium team in the past few months, the development branch of the Treebank now contains complete entity spans and types for the entire data in the Treebank, which can be accessed here. Special thanks are due to Lance Martin, Liz Davidson and Mitchell Abrams for all their efforts!

What’s included?

  • All data from the Coptic treebank (78 documents, approx. 46,000 words)
  • All spans of text referring to a named or unnamed entity, such as “Emperor Diocletian”, “the old woman” or “his cell”.
  • Nested entities contained in other entities, such a [the kingdom of [the Emperor Diocletian]]
  • Entity types, divided into the following 10 classes: (English examples are provided in brackets)

 

What do we plan to do with this?

Entity annotations are a gateway to exposing and linking semantic content information from collections of documents. Having such annotations for all of our Coptic data will allow search by entity types (and ultimately names), enable analysis and comparison of texts based on the quantity, proportion and dispersion of entity types, facilitate identification of textual reuse disregarding either the entities involved or the ways in which they are phrased, and much more.

Over the course of the summer, our next goals fall into three packages:

  1. Natural Language Processing (NLP): Develop high-accuracy automatic entity recognition tools for Coptic based on this data, and make them freely available.
  2. Corpora: Enrich all of our available data with automatic entity annotations, which can be corrected and improved iteratively in the future.
  3. Entity linking: Leverage the inventory of named entities identified in the data to carry out named entity linking with resources such as Wikipedia and other DH project identifiers. This will allow users to find all mentions of a specific person or place, regardless of how they are referred to.

Since the tools and annotations are based only on Coptic textual input and subsequent automatic NLP, we envision including search and visualization of entity data for all of our corpora, including ones for which we do not have a translation. This means that data whose content could not be easily deciphered without extensive reading of the original Coptic text will become much more easily discoverable, by exploring entities in which researchers are interested.

Stay tuned for more updates on Coptic entities!

Universal Dependencies 2.6 released!

tree

Check out the new Universal Dependencies (UD) release V2.6! This is the twelfth release of the annotated treebanks at http://universaldependencies.org/.  The project now covers syntactically annotated corpora in 92 languages, including Coptic. The size of the Coptic Treebank is now around 43,000 words, and growing. For the latest version of the Coptic data, see our development branch here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/tree/dev. For documentation, see the UD Coptic annotation guidelines.

The inclusion of the Coptic Treebank in the UD dataset means that many standard parsers and other NLP tools trained on all well attested UD languages now support Coptic out-of-the-box, including Stanford NLP’s Stanza and UFAL’s UDPipe. Feel free to try out these libraries for your data! For optimal performance on open domain Coptic text, we still recommend our custom tool-chain, Coptic-NLP, which is highly optimized to Coptic and uses additional resources beyond the treebank. Or try it out online:

Coptic-NLP demo

 

Winter 2020 Corpora Release 3.1.0

It is our pleasure to announce a new data release, with a variety of new sources from our collaborators (including more digitized data courtesy of the Marcion and PAThs projects and other scholars). New in this release are:

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

The new material in this release includes some 78,000 tokens in 33 documents and represents a tremendous amount of work by our project members and collaborators. We would like to thank the individual contributors (which you can find in the ‘annotation’ metadata), the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools, which we plan to make available in the summer. Please let us know if you have any feedback!

The Coptic Dictionary Online wins the 2019 DH Award for Best Tool

We are very happy to announce that the Coptic Dictionary Online (CDO) has won the 2019 Digital Humanities Award in the category Best Tool or Suite of Tools! The dictionary interface, shown below, gives users access to searches by Coptic word forms, definitions in three languages (English, French and German), pattern and part of speech searches, and more. We have also added links to quantitative corpus data, including attestation and collocation analyses from data published by Coptic Scriptorium.

The dictionary is the result of a collaboration between Coptic Scriptorium and lexicographers in Germany at the Berlin-Brandenburg and Göttingen Academies of Science, the Free University in Berlin, and the Universities of Göttingen and Leipzig. This collaboration has been funded by the National Endowment for the Humanities (NEH) and the German Research Foundation (DFG).

To read more about the dictionary’s structure and creation, see Feder et al. (2018).

A view of the Coptic Dictionary Online

Fall 2019 Corpora Release 3.0.0

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

  • Saints’ lives
    • Life of Cyrus
    • Life of Onnophrius
    • Lives of Longinus and Lucius
    • Martyrdom of Victor the General (part 2)
  •  Miscellaneous:
    • Dormition of John
    • Homilies of Proclus
    • Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

  • Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
  • Apophthegmata Patrum
  • A large number of corrections to most of our existing corpora, which are being republished in this release.

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

New release of Natural Language Processing Tools

Amir Zeldes and Luke Gessler  have spent much of the past summer improving Coptic Scriptorium’s Natural Language Processing tools, and are now happy to announce the release of Coptic-NLP V3.0.0. You can read more about what we’ve been doing and the impact on performance in our three part blog post (part 1, part 2, part 3). Some of the new improvements include:

  • A new 3 step normalization framework, which allows us to hypothetically normalize bound groups before deciding how to segment them, then normalize each segment again
  • A smart rebinding module which can handle deciding to merge split bound groups based on context (useful for processing messy texts with line-breaks mid word, or other segmentation anomalies)
  • A re-implemented segmentation algorithm which is especially better at handling ambiguous groups in context (e.g. “nau” in “peja|f na|u” vs. “nau ero|f”) and spelling variation
  • A brand new, more accurate part of speech tagger
  • Higher accuracy across tools thanks to hyperparameter optimization
  • More robust test suite to ensure new errors don’t creep in
  • Various data/lexicon/ruleset improvements and bugfixes

You can download the latest version of the tools here:

https://github.com/CopticScriptorium/coptic-nlp/

Or use our web interface, which has been updated with the latest version:

https://corpling.uis.georgetown.edu/coptic-nlp/

We appreciate your feedback and comments, and hope to release more data processed with these tools very soon!

Dealing with Heterogeneous Low Resource Data – Part III

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

In this post, we present some of our work on integrating more ambitious automatic normalization tools that allow us to deal with heterogeneous spelling in Coptic, and give some first numbers on improvements in accuracy through this summer’s work.

Three-step normalization

In 2018, our normalization strategy was a basic statistical one: to look up previously normalized forms in our data and choose the most frequent normalization. Because there are some frequent spelling variations, we also had a rule based system postprocess the statistical normalizer’s output, to expand, for example, common spellings of nomina sacra (e.g. ⲭⲥ for ⲭⲣⲓⲥⲧⲟⲥ ‘Christ’), even when they appeared as part of larger bound groups (ⲙⲡⲉⲭⲣⲓⲥⲧⲟⲥ ‘of the Christ’, sometimes spelled ⲙⲡⲉⲭⲥ).

One of the problems with this strategy is that for many individual words, we might know common normalizations, such as spelling ⲏⲉⲓ for ⲏⲓ ‘house’, but recognizing that normalization should be carried out depends on correct segmentation – if the system sees ⲙⲡⲏⲉⲓ  ‘of the house’ it may not be certain that normalization should occur. Paradoxically, correct normalization vastly improves segmentation accuracy, which is needed for normalization… resulting in a vicious circle.

To address the challenges of normalizing Coptic orthography, this summer we developed a three level process:

  • We consider hypothetical normalizations which could be applied to bound groups if we spelled certain words together, then choose what to spell together (see Part II of this post series)
  • We consider normalizations for the bound groups we ended up choosing, based on past experience (lookup), rules (finite-state morphology) and machine learning (feature based prediction)
  • After segmenting bound groups into morphological categories, we consider whether the segmented sequence contains smaller units that should be normalized

To illustrate how this works, we can consider the following example:

Coptic edition:            ⲙ̅ⲡ     ⲉⲓⲙ̅ⲡϣⲁ

Romanized:                mp    ei|mpša

Gloss:                          didn’t  I-worthy

Translation:                “I was not worthy”

These words should be spelled together by our conventions, based on Layton’s (2011) grammar, but Budge’s (1914) edition has placed a space here and the first person marker is irregularly spelled epsilon-iota, ⲉⲓ ‘I’, instead of just iota, ⲓ. When resolving whitespace ambiguity, we ask how likely it is that mp stands alone (unlikely), but also whether mp+ei… is grammatical, which in the current spelling might not be recognized. Our normalizer needs to resolve the hypothetical fused group to be spelled ⲙⲡⲓⲙⲡϣⲁ, mpimpša. Since this particular form has not appeared before in our corpora, we rely on data augmentation: our system internally generates variant spellings, for example substituting the common spelling variation of ⲓ with ⲉⲓ in words we have seen before, and generating a solution ⲙⲡⲉⲓⲙⲡϣⲁ -> ⲙⲡⲓⲙⲡϣⲁ. The augmentation system relies both on previously seen forms (a normally spelled ⲙⲡⲓⲙⲡϣⲁ, which we have however also not seen before) and combinations produced by a grammar (it considers the negative past auxiliary ⲙⲡ followed by all subject pronouns and verbs in our lexicon, which does yield the necessary ⲙⲡⲓⲙⲡϣⲁ).

The segmenter is then able to successfully segment this into mp|i|mpša, and we therefore decide:

  1. This should be fused
  2. This can be segmented into three segments
  3. The middle segment is the first person pronoun (with non-standard spelling)
  4. It should be normalized (and subsequently tagged and lemmatized)

If normalization had failed for the whole word group, there is still a chance that the machine learning segmenter would have recognized mpša ‘worthy’ and split it apart, which means that segmentation is slightly less impacted by normalization errors than it would have been in our tools a year ago.

How big of a deal is this?

It’s hard to give an idea of what each improvement like this does for the quality of our data, but we’ll try to give some numbers and contextualize them here. The table below shows an evaluation on the same training and test data: in-domain data comes from UD Coptic Test 2.4, and out-of-domain data represents two texts from editions by W. Budge: the Life of Cyrus and the Repose of John the Evangelist, previously digitized by the Marcion project. The distinction between in-domain and out-of-domain is important here, as in-domain evaluation gives the tools test data that comes from the same distribution of text types the tools are trained on, and is consequently much less surprising. Out-of-domain data comes from text types the system has not seen before, edited with very different editorial practices.

 

2018 2019
task in domain out of domain in domain out of domain
spaces NA* 96.57 NA* 98.08
orthography 98.81 95.79 99.76 97.83
segmentation 97.78 (96.86**) 93.67 (92.28**) 99.54 (99.36**) 96.71 (96.25**)
tagging 96.23 96.34 98.35 98.11

 

* In domain data has canonical whitespace and cannot be used to test white space normalization

** Numbers in brackets represent automatically normalized input (more realistic, but harder to judge performance of segmentation as an isolated task)

 

The numbers show that several tools have improved dramatically across the board, even for in-domain data – most noticeably the part of speech tagger and normalizer modules. The improvement in segmentation accuracy is much more marked on out-of-domain data, and especially if we do not give the segmenter normalized text as input (numbers in brackets). In this case, automatic normalization is used, and the improvements in normalization cascade into better segmentation as well.

Qualitatively these differences are transformative for Coptic Scriptorium: the ability to handle out-of-domain data with good accuracy is essential for making large amounts of digitized text available that come from earlier digitization efforts and partner projects. Although 2018 accuracies on the left may look alright, the reduction in error rates is more than half in some cases (7.72% down to 3.75% in realistic segmentation accuracy out-of-domain). Additionally, the reduced errors are qualitatively different: the bulk of accuracy up to 90% represents easy cases, such as tagging and segmenting function words (e.g. prepositions) correctly. The last few percent represent the hard cases, such as unknown proper names, unusual verb forms and grammatical constructions, loan words, and other high-interest items.

You can get the latest release of the Coptic-NLP tools (V3.0.0) here. We plan to release new data processed with these tools soon – stay tuned!

Dealing with Heterogeneous Low Resource Data – Part II

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

The first step in processing heterogeneous data in Coptic is deciding when words should be spelled together. As we described in part I, this is a problem because there are no spaces in original Coptic manuscripts, and editorial standards for how to segment words have varied through the centuries.

Moreover, even within a single edition, there may be inconsistencies in word segmentation. Take, for instance, the word ⲉⲃⲟⲗ (ebol), ‘out, outward’. This word is historically a combination of the preposition ⲉ (e), ‘to, towards’, and the noun ⲃⲟⲗ (bol), ‘outside, exterior’. In some editions, such as those by W. Budge, it is variously spelled as either one (ebol) or two (e bol) words, as in this example from the Asketikon of Apa Ephraim:

ⲡⲃⲱⲗ ⲉⲃⲟⲗ ⲉ ⲡⲉϩⲟⲩⲟ ϩⲛ̅ ⲟⲩϩⲓⲛⲏⲃ ⲉϥϩⲟⲣϣ̅
pbōl ebol e pehouo hn ouhineb efhorš
`<translation>`

ⲙⲏ ⲙ̅ⲡ ϥ̅ⲃⲱⲗ ⲉ ⲃⲟⲗ ⲛ̅ ⲛⲉⲥⲛⲁⲩϩ
mē mp fbōl e bol n nesnauh
`<translation>`

(Lines 12.2, 15.25 in Asketikon of Apa Ephraim. Transcribed by the Marcion project, based on W. Budge’s (1914) edition.)

Up until 2017, we had no automatic tools to ensure consistent word separation, and up until recently we used only a simple approach based on relative probability of being spelled apart: words spelled apart less than 90% of the time in our existing data were attached to the following word. For instance, across 4470 occurrences, the word  (e) was attached to the following word ~92% of the time. That is above 90%, so our simple system would always attach an  (e) to the following word, regardless of context. This approach is capable of effectively dealing with common cases such as prepositions, but it is incapable of handling more complex cases, e.g. where identically-spelled words exhibit different behaviors, or a word has never been seen before.

In summer of 2019, we set out to develop new machine learning tools for solving this whitespace normalization problem. We first considered the most obvious way to frame this problem, as a sequence to sequence (seq2seq) prediction problem: given a sequence of Coptic characters, predict another sequence of Coptic characters, hopefully with spaces inserted in the right places.

The problem is that seq2seq models require a lot of annotated data, much more than we had on hand. At the time, we only had on the order of tens of thousands of words’ worth of hand-normalized text from the type of edition shown in the example. We found that this was much too little data for any usual seq2seq model, like an LSTM (a Long-Short Term Memory neural network).

The key to progress was to observe that in most editions, the difference in whitespace was that there were too many spaces: it almost never happened that two words were spelled together that should have been apart. That left just the case where two words were spelled apart that should have been put together.

The question was now, simply, “for each whitespace that did occur in the edition, should we delete it, or should we keep it?” This is a simple binary classification task, which makes the task we’re asking of the computer much less demanding: instead of asking it to produce a stream of characters, we are asking it for a simple yes/no judgment.

But what kind of information goes into a yes/no decision like this? After a lot of experimentation, we found that the answers to these questions (in addition to others) were most helpful in deciding whether to keep or delete a space in between two words:

  • How common are the words on either side of the space? (Our proxy for “common”ness: how often it appears in our annotated corpora)
  • How common is the word I’d get if I deleted the space between the two words?
  • How long are the two words and the words around them? (This might be a hint—it’s very unlikely, for instance, that a preposition would be more than several characters long.)
  • What are the parts of speech of the words around the space?
  • Does the word to the right consist solely of punctuation?

We tried several machine learning algorithms using this approach. To begin with, we only had ~10,000 words of training data, which is too little for many algorithms to effectively learn. In the end, our XGBoost model ended up performing the best, with a “correctness rate” (F1) of ~99%, against the naïve baseline (keep all spaces all the time), which hovered around 78%.

« Older posts Newer posts »