Author: Amir Zeldes (page 2 of 3)

Fall 2019 Corpora Release 3.0.0

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

  • Saints’ lives
    • Life of Cyrus
    • Life of Onnophrius
    • Lives of Longinus and Lucius
    • Martyrdom of Victor the General (part 2)
  •  Miscellaneous:
    • Dormition of John
    • Homilies of Proclus
    • Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

  • Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
  • Apophthegmata Patrum
  • A large number of corrections to most of our existing corpora, which are being republished in this release.

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

Dealing with Heterogeneous Low Resource Data – Part III

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

In this post, we present some of our work on integrating more ambitious automatic normalization tools that allow us to deal with heterogeneous spelling in Coptic, and give some first numbers on improvements in accuracy through this summer’s work.

Three-step normalization

In 2018, our normalization strategy was a basic statistical one: to look up previously normalized forms in our data and choose the most frequent normalization. Because there are some frequent spelling variations, we also had a rule based system postprocess the statistical normalizer’s output, to expand, for example, common spellings of nomina sacra (e.g. ⲭⲥ for ⲭⲣⲓⲥⲧⲟⲥ ‘Christ’), even when they appeared as part of larger bound groups (ⲙⲡⲉⲭⲣⲓⲥⲧⲟⲥ ‘of the Christ’, sometimes spelled ⲙⲡⲉⲭⲥ).

One of the problems with this strategy is that for many individual words, we might know common normalizations, such as spelling ⲏⲉⲓ for ⲏⲓ ‘house’, but recognizing that normalization should be carried out depends on correct segmentation – if the system sees ⲙⲡⲏⲉⲓ  ‘of the house’ it may not be certain that normalization should occur. Paradoxically, correct normalization vastly improves segmentation accuracy, which is needed for normalization… resulting in a vicious circle.

To address the challenges of normalizing Coptic orthography, this summer we developed a three level process:

  • We consider hypothetical normalizations which could be applied to bound groups if we spelled certain words together, then choose what to spell together (see Part II of this post series)
  • We consider normalizations for the bound groups we ended up choosing, based on past experience (lookup), rules (finite-state morphology) and machine learning (feature based prediction)
  • After segmenting bound groups into morphological categories, we consider whether the segmented sequence contains smaller units that should be normalized

To illustrate how this works, we can consider the following example:

Coptic edition:            ⲙ̅ⲡ     ⲉⲓⲙ̅ⲡϣⲁ

Romanized:                mp    ei|mpša

Gloss:                          didn’t  I-worthy

Translation:                “I was not worthy”

These words should be spelled together by our conventions, based on Layton’s (2011) grammar, but Budge’s (1914) edition has placed a space here and the first person marker is irregularly spelled epsilon-iota, ⲉⲓ ‘I’, instead of just iota, ⲓ. When resolving whitespace ambiguity, we ask how likely it is that mp stands alone (unlikely), but also whether mp+ei… is grammatical, which in the current spelling might not be recognized. Our normalizer needs to resolve the hypothetical fused group to be spelled ⲙⲡⲓⲙⲡϣⲁ, mpimpša. Since this particular form has not appeared before in our corpora, we rely on data augmentation: our system internally generates variant spellings, for example substituting the common spelling variation of ⲓ with ⲉⲓ in words we have seen before, and generating a solution ⲙⲡⲉⲓⲙⲡϣⲁ -> ⲙⲡⲓⲙⲡϣⲁ. The augmentation system relies both on previously seen forms (a normally spelled ⲙⲡⲓⲙⲡϣⲁ, which we have however also not seen before) and combinations produced by a grammar (it considers the negative past auxiliary ⲙⲡ followed by all subject pronouns and verbs in our lexicon, which does yield the necessary ⲙⲡⲓⲙⲡϣⲁ).

The segmenter is then able to successfully segment this into mp|i|mpša, and we therefore decide:

  1. This should be fused
  2. This can be segmented into three segments
  3. The middle segment is the first person pronoun (with non-standard spelling)
  4. It should be normalized (and subsequently tagged and lemmatized)

If normalization had failed for the whole word group, there is still a chance that the machine learning segmenter would have recognized mpša ‘worthy’ and split it apart, which means that segmentation is slightly less impacted by normalization errors than it would have been in our tools a year ago.

How big of a deal is this?

It’s hard to give an idea of what each improvement like this does for the quality of our data, but we’ll try to give some numbers and contextualize them here. The table below shows an evaluation on the same training and test data: in-domain data comes from UD Coptic Test 2.4, and out-of-domain data represents two texts from editions by W. Budge: the Life of Cyrus and the Repose of John the Evangelist, previously digitized by the Marcion project. The distinction between in-domain and out-of-domain is important here, as in-domain evaluation gives the tools test data that comes from the same distribution of text types the tools are trained on, and is consequently much less surprising. Out-of-domain data comes from text types the system has not seen before, edited with very different editorial practices.

 

2018 2019
task in domain out of domain in domain out of domain
spaces NA* 96.57 NA* 98.08
orthography 98.81 95.79 99.76 97.83
segmentation 97.78 (96.86**) 93.67 (92.28**) 99.54 (99.36**) 96.71 (96.25**)
tagging 96.23 96.34 98.35 98.11

 

* In domain data has canonical whitespace and cannot be used to test white space normalization

** Numbers in brackets represent automatically normalized input (more realistic, but harder to judge performance of segmentation as an isolated task)

 

The numbers show that several tools have improved dramatically across the board, even for in-domain data – most noticeably the part of speech tagger and normalizer modules. The improvement in segmentation accuracy is much more marked on out-of-domain data, and especially if we do not give the segmenter normalized text as input (numbers in brackets). In this case, automatic normalization is used, and the improvements in normalization cascade into better segmentation as well.

Qualitatively these differences are transformative for Coptic Scriptorium: the ability to handle out-of-domain data with good accuracy is essential for making large amounts of digitized text available that come from earlier digitization efforts and partner projects. Although 2018 accuracies on the left may look alright, the reduction in error rates is more than half in some cases (7.72% down to 3.75% in realistic segmentation accuracy out-of-domain). Additionally, the reduced errors are qualitatively different: the bulk of accuracy up to 90% represents easy cases, such as tagging and segmenting function words (e.g. prepositions) correctly. The last few percent represent the hard cases, such as unknown proper names, unusual verb forms and grammatical constructions, loan words, and other high-interest items.

You can get the latest release of the Coptic-NLP tools (V3.0.0) here. We plan to release new data processed with these tools soon – stay tuned!

Dealing with Heterogeneous Low Resource Data – Part I

Image from Budge’s (1914), Coptic Martyrdoms in the Dialect of Upper Egypt

Image from Budge’s (1914), Coptic Martyrdoms
in the Dialect of Upper Egypt
(scan made available by archive.org)

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

A major challenge for Coptic Scriptorium as we expand to cover texts from other genres, with different authors, styles and transcription practices, is how to make everything uniform. For example, our previously released data has very specific transcription conventions with respect to what to spell together, based on Layton’s (2011:22-27) concept of bound groups, how to normalize spellings, what base forms to lemmatize words to, and how to segment and analyze groups of words internally.

An example of our standard is shown in below, with segments inside groups separated by ‘|’:

Coptic original:         ⲉⲃⲟⲗ ϩⲙ̅|ⲡ|ⲣⲟ    (Genesis 18:2)

Romanized:                 ebol hm|p|ro

Translation:                 out of the door

The words hm ‘in’, p ‘the’ and ro ‘door’ are spelled together, since they are phonologically bound: similarly to words spelled together in Arabic or Hebrew, the entire phrase carries one stress (on the word ‘door’) and no words may be inserted between them. Assimilation processes unique to the environment inside bound groups also occur, such as hm ‘in’, which is normally hn with an ‘n’, which becomes ‘m’ before the labial ‘p’, a process which does not occur across adjacent bound groups.

But many texts which we would like to make available online are transcribed using very different conventions, such as (2), from the Life of Cyrus, previously transcribed by the Marcion project following the convention of W. Budge’s (1914) edition:

 

Coptic original:    ⲁ    ⲡⲥ̅ⲏ̅ⲣ̅               ⲉⲓ  ⲉ ⲃⲟⲗ    ϩⲙ̅ ⲡⲣⲟ  (Life of Cyrus, BritMusOriental6783)

Romanized:           a     p|sēr              ei   e bol   hm p|ro

Gloss:                        did the|savior go to-out in the|door

Translation:          The savior went out of the door

 

Budge’s edition usually (but not always) spells prepositions apart, articles together and the word ebol in two parts, e + bol. These specific cases are not hard to list, but others are more difficult: the past auxiliary is just a, and is usually spelled together with the subject, here ‘savior’. However, ‘savior’ has been spelled as an abbreviation: sēr for sōtēr, making it harder to recognize that a is followed by a noun and is likely to be the past tense marker, and not all cases of a should be bound. This is further complicated by the fact that words in the edition also break across lines, meaning we sometimes need to decide whether to fuse parts of words that are arbitrarily broken across typesetting boundaries as well.

The amount of material available in varying standards is too large to manually normalize each instance to a single form, raising the question of how we can deal with these automatically. In the next posts we will look at how white space can be normalized using training data, rule based morphology and machine learning tools, and how we can recover standard spellings to ensure uniform searchability and online dictionary linking.

 

References

Layton, B. (2011). A Coptic Grammar. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Budge, E.A.W. (1914) Coptic Martyrdoms in the Dialect of Upper Egypt. London: Oxford University Press.

New features in our NLP pipeline

Coptic Scriptorium’s Natural Language Processing (NLP) tools now support two new features:

  • Multiword expression recognition
  • Detokenization (bound group re-merging)

Kicking off work on the new phase of our project, these new tools will improve inter-operability of Coptic data across corpora, lexical resources and projects:

Multiword expressions

The multiword expression ⲉⲃⲟⲗ ϩⲛ "out of" (from Apophthegmata Patrum 27, MOBG EG 67. Image: Österreichische Nationalbibliothek)

The multiword expression ⲉⲃⲟⲗ ϩⲛ “out of” (from Apophthegmata Patrum 27, MOBG EG 67. Image: Österreichische Nationalbibliothek)

Although lemmatization and normalization already offer a good way of finding base forms of Coptic words, many complex expressions cross word borders in Coptic. For example, although it is possible to understand combinations such as ⲉⲃⲟⲗ ‘out’ + ϩⲛ ‘in’, or ⲥⲱⲧⲙ ‘hear’ + ⲛⲥⲁ ‘behind’ approximately from the meaning of each word, together they have special senses, such as ‘out of’ and ‘obey’ respectively.  This and similar combinations are distinct enough from their constituents that they receive different lexicon entries in dictionaries, for example in the Coptic Dictionary Online (CDO), compare: ⲥⲱⲧⲙ, ⲛⲥⲁ and ⲥⲱⲧⲙ ⲛⲥⲁ.

Thanks to the availability of the CDO’s data,  the NLP tools can now attempt to detect known multiword expressions, which can then be linked back to the dictionary and used to collect frequencies for complex items. 

Many thanks to Maxim Kupreyev for his help in setting up multiword expressions in the dictionary, as well as to Frank Feder, So Miyagawa, Sebastian Richter and other KELLIA collaborators for making these lexical resources available!

Detokenization

Coptic bound groups have been written with intervening spaces according to a number of similar but subtly different traditions, such as Walter Till’s system and the system employed by Bentley Layton’s  Coptic Grammar, which Coptic Scriptorium employs. The differences between these and other segmentation traditions can create some problems:

  1. Users searching in multiple corpora may be surprised when queries behave differently due to segmentation differences.
  2. Machine learning tools trained on one standard degrade in performance when the data they analyze uses a different standard.

In order to address these issues and have more consistent and more accurately analyzed data, we have added a component to our tools which can attempt to merge bound groups into ‘Laytonian’ bound groups. In Computational Linguistics, re-segmenting a segmented text is referred to as ‘detokenization’, but for our tools this has also been affectionately termed ‘Laytonization’. The new detokenizer has several options to choose from:

  1. No merging – this is the behavior of our tools to date, no modifications are undertaken.
  2. Conservative merging mode – in conservative merging, only items known to be spelled apart in different segmentations are merged. For example, in the sequence ϩⲙ ⲡⲏⲓ “in the-house”, the word ϩⲙ “in” is typically spelled apart in Till’s system, but together in Layton’s. This type of sequence would be merged in conservative mode.
  3. Aggressive merging mode – in this mode, anything that is most often spelled bound in our training data is merged. This is done even if the segment being bound by the system is not not one that would normally be spelled apart in some other conventional system. For example, the sequence ⲁ ϥⲥⲱⲧⲙ “(PAST) he heard”, the past tense marker is a unit that no Coptic orthographic convention spells apart. It is relatively unlikely that it should stand apart in normal Coptic text in any convention, so in aggressive mode it would be merged as well.
  4. Segment at merge point – regardless of the merging mode chosen, if any merging occurs, this option enforces the presence of a morphological boundary at any merge point. This ensures that merged items are always assigned to separate underlying words, and receive part of speech annotations accordingly, even if our machine learning segmenter does not predict that the merged bound group should be segmented in this way.

The use of these options is expected to correspond more or less to the type of input text: for carefully edited text from a different convention (e.g. Till), conservative merging is with segmentation at merge points is recommended. For ‘messier’ text (e.g. older digitized editions with varying conventions, such as editions by Wallis Bugde, or material from automatic Optical Character Recognition), aggressive merging is advised, and we may not necessarily want to assume that segments should be introduced at merge points.

We hope these tools will be useful and expect to see them create more consistency, higher accuracy and inter-operability between resources in the near future!

Coptic Scriptorium wins DHAG grant from the NEH

NEH Logo MASTER_082010

copticlogoenglishlogo

We are very pleased to report that the National Endowment for the Humanities just announced their approval of a new Stage III Digital Humanities Advancement Grant continuing their long-standing support of our work in Coptic Scriptorium. The new grant is titled:

A Linked Digital Environment for Coptic Studies: Integrating Heterogeneous Data with Machine Learning and Natural Language Processing

The focus of the grant, which is set to run from fall 2018 for three years, is developing more robust tools which can deliver high accuracy analyses with less manual intervention in the face of more heterogeneous data, including Coptic materials from OCR, spelling variation across editions and varying scholarly conventions. This will allow us to grow the collection of corpora made available over the project’s tools in the coming months and years. Additional areas of work include improving our work on Linked Open Data standards connecting the project to other initiatives in the field, and pioneering methods for automatic Named Entity Recognition in Coptic, among other things!

Please stay tuned for more updates on the new project – in the meantime we thank the NEH for their trust in us, our project members, contributors and advisory board for all their work, and the Digital Coptic community for your support!

New paper about the Coptic Dictionary Online

Coptic Dictionary Online

A new paper about the Coptic Dictionary Online will be presented at this year’s ACL SIGHUM workshop on Language Technology for Cultural Heritage. This work is a collaboration between the Akademie der Wissenschaften zu Göttingen, Berlin-Brandenburgische Akademie der Wissenschaften, and Coptic Scriptorium.

The paper presents the structure and underlying principles of the dictionary and its Web interface, and also gives a quantitative analysis of the dictionary’s coverage of Coptic lexical material in corpus data. You can check out the pre-print here:

Feder, Frank, Kupreyev, Maxim, Manning, Emma, Schroeder, Caroline T. and Zeldes, Amir (2018) “A Linked Coptic Dictionary Online”. Proceedings of LaTeCH 2018 – The 11th SIGHUM Workshop at COLING2018. Santa Fe, NM.

[paper]

Automatically parsed OT and NT corpora

We are pleased to announce that a new version of the automatically annotated New Testament and Old Testament corpora is now available online in Coptic Scriptorium!

The new version has substantially better automatic segmentation accuracy, and, for the first time, automatic syntactic parses for each verse. For more information on the syntax annotations, please see our previous post here:

https://blog.copticscriptorium.org/2018/05/07/coptic-treebank-2-2-moving-us-to-better-parsing/

Here are some example queries to get you started:

Thanks as always to the NEH and DFG for their support and to everyone who made the texts available, which come from the Sahidica version of the NT ((c) J. Warren Wells) and the OT text contributed by  the CrossWire Bible Society SWORD Project, thanks to work by Christian Askeland, Matthias Schulz and Troy Griffitts.

Coptic Treebank 2.2 – moving us to better parsing!

With the data release of Universal Dependencies 2.2, an update to the Coptic Treebank is now online! Thanks to work by Mitchell Abrams and Liz Davidson we’ve been able to add the first three chapters from 1 Corinthians and make numerous corrections. Another three chapters of 1 Corinthians and a portion of the Martyrdom of Victor the General are coming soon. You can see how we’ve been annotating and the documentation of our guidelines here:

http://universaldependencies.org/cop/

Thanks to the new data, automatic parsing has become somewhat more reliable, allowing us to add automatic parses to the most recent release. The results are better than before, but note we still only expect around 90% accuracy. To illustrate where the computer can’t do what humans can, here are two examples of a verb governing a subordinate verb in a clause marked by Ϫⲉ ‘that’. The subordinate verb usually has one of two labels:

  • ccomp if it’s a complement clause (I said that…)
  • advcl if it’s an adverbial clause, such as a causal clause (Ϫⲉ  meaning ‘because’).

One of these examples was done by a human who got things right, the other contains a parser error – see if you can spot which is which!

 

New corpora – release 2.4.0 is out!

We are pleased to announce release version 2.4.0 with new corpora, with tagged and lemmatized corpora available for reading and download at [1], and fully searchable at [2]:

[1] http://data.copticscriptorium.org/

[2] https://corpling.uis.georgetown.edu/annis/scriptorium

This release contains new data contributed by Alin Suciu, David Brakke and Diliana Atanassova, as well as out of copyright edition material contributed by the Marcion project. New data in this release includes excerpts from:

  • The Martyrdom of Saint Victor the General (2033 tokens)
  • The Canons of Apa Johannes (438 tokens)
  • Pseudo-Theophilus On the Cross and The Thief (2814 tokens)
  • Shenoute, Some Kinds of People Sift Dirt (888 tokens)
  • 11 additional Apophthegmata Patrum, bringing the total released to 63 apophthegms (7077 tokens)

All texts are also linked to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/), which has been updated with frequency information including these texts. We would like to thank the annotators and translators of these data sets, several of whom are new to the project, without whose work the corpora would not be online:

Alexander Turtureanu, Alin Suciu, Amir Zeldes, Caroline T. Schroeder, Christine Luckritz Marquis, Dana Robinson, David Brakke, David Sriboonreuang, Diliana Atanassova, Elizabeth Davidson, Elizabeth Platte, Gianna Zipp, J. Gregory Given, Janet Timbie, Jennifer Quigley, Laura Slaughter, Lauren McDermott, Marina Ghaly, Mitchell Abrams, Paul Lufter, Rebecca Krawiec, Saskia Franck and Tobias Paul

We hope everyone will find this release useful and look forward to releasing more data in the coming year!

 

New release of the Coptic Treebank

Coptic Treebank release 2.1, now with three Letters of Besa!

We are pleased to announce the release of the latest version of the Coptic Treebank, now containing three Letters of Besa:

  • On Lack of Food
  • To Aphthonia
  • To Thieving Nuns

This brings the total corpus size up to 10,499 tokens, thanks to annotation work by Elizabeth Davidson and Amir Zeldes, building on earlier transcription and tagging work by Coptic Scriptorium and KELLIA partners. Special thanks are due to So Miyagawa for providing the transcription for On Lack of Food. The corpus will continue to grow as we work to annotate more data and improve the accuracy of our automatic syntax parser for Coptic. You can search the current version of the corpus in ANNIS here:

https://corpling.uis.georgetown.edu/annis/scriptorium

Or download the latest raw annotated data from GitHub here:

https://github.com/universalDependencies/UD_Coptic/tree/dev

Please let us know if you find any errors or have any feedback on the treebank!

Older posts Newer posts