Coptic SCRIPTORIUM Blog

Page 5 of 10

Rebecca Krawiec, Elizabeth Platte, Amir Zeldes, Caroline T. Schroeder at Georgetown University, 2018

Recent presentations by Coptic Scriptorium team members (post 1 of 2)!

This fall, Coptic Scriptorium team members have presented their work in a number of environments.

Research Talk, Georgetown University Linguistics Speaker Series

In September, as part of the Georgetown University Department of Linguistics Friday Speaker Series, the project presented a summary of our latest work and our goals for the new NEH Digital Humanities Advancement Grant we received. “A Linked Digital Environment for Coptic Studies”. Caroline T. Schroeder provided an overview of the project. Amir Zeldes presented the technology required to machine-process Coptic text in order to produce an annotated, digital corpus and linked online lexicon. Rebecca Krawiec discussed the research potential of an annotated digital corpus for research in early monasticism. Elizabeth Platte introduced the concept of linked data and demonstrated our linked geographic data features. (Christine Luckritz Marquis was scheduled present research on space and place in monastic literature but was unfortunately sidelined by a hurricane.)

Rebecca Krawiec, Elizabeth Platte, Amir Zeldes, Caroline T. Schroeder at Georgetown University, 2018

Material of Christian Apocrypha Conference

In December, Caroline T. Schroeder gave a paper at the Material of Christian Apocrypha Conference hosted at the University of Virginia, under the auspices of the North American Society for the Study of Christian Apocryphal Literature. Dr. Schroeder’s paper, “The Materiality of Digital Apocryphal Studies,” addressed the role of digital humanities in studying the colonial history of manuscripts, people and places in early Christian literature, and public humanities. It was part of a panel on Christian Apocrypha and the Digital Humanities, which also included papers by James Walters (Rochester College) on “The Digital Syriac Corpus: A New Resource for the Study of Syriac Texts” and Brandon Hawk (Rhode Island College) on “The Medieval Social Network of the Gospel of Pseudo-Matthew”. Datasets used in the presentation are available at Dr. Schroeder’s GitHub site.

Caroline T. Schroeder presenting about the manuscripts digitized by Coptic Scriptorium

Caroline T. Schroeder presenting visualizations of occurrences of proper names in some of Coptic Scriptorium’s corpora

New features in our NLP pipeline

December 5, 2018 / Amir Zeldes / 2 Comments

Coptic Scriptorium’s Natural Language Processing (NLP) tools now support two new features:

Multiword expression recognition
Detokenization (bound group re-merging)

Kicking off work on the new phase of our project, these new tools will improve inter-operability of Coptic data across corpora, lexical resources and projects:

Multiword expressions

The multiword expression ⲉⲃⲟⲗ ϩⲛ “out of” (from Apophthegmata Patrum 27, MOBG EG 67. Image: Österreichische Nationalbibliothek)

Although lemmatization and normalization already offer a good way of finding base forms of Coptic words, many complex expressions cross word borders in Coptic. For example, although it is possible to understand combinations such as ⲉⲃⲟⲗ ‘out’ + ϩⲛ ‘in’, or ⲥⲱⲧⲙ ‘hear’ + ⲛⲥⲁ ‘behind’ approximately from the meaning of each word, together they have special senses, such as ‘out of’ and ‘obey’ respectively. This and similar combinations are distinct enough from their constituents that they receive different lexicon entries in dictionaries, for example in the Coptic Dictionary Online (CDO), compare: ⲥⲱⲧⲙ, ⲛⲥⲁ and ⲥⲱⲧⲙ ⲛⲥⲁ.

Thanks to the availability of the CDO’s data, the NLP tools can now attempt to detect known multiword expressions, which can then be linked back to the dictionary and used to collect frequencies for complex items.

Many thanks to Maxim Kupreyev for his help in setting up multiword expressions in the dictionary, as well as to Frank Feder, So Miyagawa, Sebastian Richter and other KELLIA collaborators for making these lexical resources available!

Detokenization

Coptic bound groups have been written with intervening spaces according to a number of similar but subtly different traditions, such as Walter Till’s system and the system employed by Bentley Layton’s A Coptic Grammar, which Coptic Scriptorium employs. The differences between these and other segmentation traditions can create some problems:

Users searching in multiple corpora may be surprised when queries behave differently due to segmentation differences.
Machine learning tools trained on one standard degrade in performance when the data they analyze uses a different standard.

In order to address these issues and have more consistent and more accurately analyzed data, we have added a component to our tools which can attempt to merge bound groups into ‘Laytonian’ bound groups. In Computational Linguistics, re-segmenting a segmented text is referred to as ‘detokenization’, but for our tools this has also been affectionately termed ‘Laytonization’. The new detokenizer has several options to choose from:

No merging – this is the behavior of our tools to date, no modifications are undertaken.
Conservative merging mode – in conservative merging, only items known to be spelled apart in different segmentations are merged. For example, in the sequence ϩⲙ ⲡⲏⲓ “in the-house”, the word ϩⲙ “in” is typically spelled apart in Till’s system, but together in Layton’s. This type of sequence would be merged in conservative mode.
Aggressive merging mode – in this mode, anything that is most often spelled bound in our training data is merged. This is done even if the segment being bound by the system is not not one that would normally be spelled apart in some other conventional system. For example, the sequence ⲁ ϥⲥⲱⲧⲙ “(PAST) he heard”, the past tense marker ⲁ is a unit that no Coptic orthographic convention spells apart. It is relatively unlikely that it should stand apart in normal Coptic text in any convention, so in aggressive mode it would be merged as well.
Segment at merge point – regardless of the merging mode chosen, if any merging occurs, this option enforces the presence of a morphological boundary at any merge point. This ensures that merged items are always assigned to separate underlying words, and receive part of speech annotations accordingly, even if our machine learning segmenter does not predict that the merged bound group should be segmented in this way.

The use of these options is expected to correspond more or less to the type of input text: for carefully edited text from a different convention (e.g. Till), conservative merging is with segmentation at merge points is recommended. For ‘messier’ text (e.g. older digitized editions with varying conventions, such as editions by Wallis Bugde, or material from automatic Optical Character Recognition), aggressive merging is advised, and we may not necessarily want to assume that segments should be introduced at merge points.

We hope these tools will be useful and expect to see them create more consistency, higher accuracy and inter-operability between resources in the near future!

Coptic Scriptorium wins DHAG grant from the NEH

August 8, 2018 / Amir Zeldes / 0 Comments

We are very pleased to report that the National Endowment for the Humanities just announced their approval of a new Stage III Digital Humanities Advancement Grant continuing their long-standing support of our work in Coptic Scriptorium. The new grant is titled:

A Linked Digital Environment for Coptic Studies: Integrating Heterogeneous Data with Machine Learning and Natural Language Processing

The focus of the grant, which is set to run from fall 2018 for three years, is developing more robust tools which can deliver high accuracy analyses with less manual intervention in the face of more heterogeneous data, including Coptic materials from OCR, spelling variation across editions and varying scholarly conventions. This will allow us to grow the collection of corpora made available over the project’s tools in the coming months and years. Additional areas of work include improving our work on Linked Open Data standards connecting the project to other initiatives in the field, and pioneering methods for automatic Named Entity Recognition in Coptic, among other things!

Please stay tuned for more updates on the new project – in the meantime we thank the NEH for their trust in us, our project members, contributors and advisory board for all their work, and the Digital Coptic community for your support!

Presenters and attendees of the 19th annual UCLA-St. Shenouda Society Coptic Studies Conference

Coptic SCRIPTORIUM at the UCLA-St Shenouda Coptic Studies Conference

August 2, 2018 / ctschroeder / 0 Comments

Coptic SCRIPTORIUM’s Carrie Schroeder presented a paper on our project’s latest work at the nineteenth annual UCLA-St. Shenouda Society Coptic Studies Conference. As usual, the conference showcased papers from a diverse set of presenters, with scholarship on early monasticism all the way to modern Coptic architecture. We thank Hany Takla, President of the Society and longtime friend of the project, for the opportunity to attend.

Presenters and attendees of the 19th annual UCLA-St. Shenouda Society Coptic Studies Conference

German PI Dr. Prof. Heike Behlmer and US PI Caroline T. Schroeder at Schroeder's recent visit to the Coptic Old Testament Project at the University and the Goettingen Academy.

Coptic Scriptorium’s summer adventures

July 20, 2018 / ctschroeder / 0 Comments

This has been a summer of writing, annotating, and conferencing!

German PI Dr. Prof. Heike Behlmer and US PI Caroline T. Schroeder at Schroeder’s recent visit to the Coptic Old Testament Project at the University and the Goettingen Academy.

We are winding up our collaborative grant with our German partners (Coptic Old Testament Project, the Thesaurus Linguae Aegyptiae, the DDGLC, and the INTF). Our German and US PI’s met in Göttingen, Germany, earlier this summer. We’re working on writing our final reports and exchanging data and technologies. We’re hoping to publish more annotated texts later this year.

We also have had a series of conference papers, including a paper on one of our collaboration’s proudest achievements, the online Coptic Dictionary. Here are some of the lectures and conference presentations this summer:

Miyagawa, So and Zeldes, Amir (2018) “A Semantic Map of the Coptic Complementizer če Based on Corpus Analysis: Grammaticalization and Areal Typology in Africa,” International Workshop on Semantic maps: Where do we stand and where are we going? Liège, Belgium. June.

Schroeder, Caroline T. (2018) “A Homily is a Homily is a Homily is a Corpus: Digital Approaches to Shenoute,” The Transmission of Early Christian Homilies from Late Antiquity to the Middle Ages Conference, Goethe-Universität Frankfurt am Main. June.

Schroeder, Caroline T. (2018) “Coptic Studies in the Digital Age,” Department of Ancient History, Macquarie University. July.

Schroeder, Caroline T. (2018) “Coptic Studies in a Digital Age,” UCLA-St. Shenouda Foundation Coptic Studies Conference, Los Angeles. July.

Feder, Frank,Maxim Kupreyev, Emma Manning, Caroline T. Schroeder, Amir Zeldes. “A Linked Coptic Dictionary Online”. Proceedings of LaTeCH 2018 – The 11th SIGHUM Workshop at COLING2018. Santa Fe, NM. August. [paper online]

As always, thanks to all our contributors, collaborators, and board members for their insight and labor.

New paper about the Coptic Dictionary Online

July 12, 2018 / Amir Zeldes / 0 Comments

A new paper about the Coptic Dictionary Online will be presented at this year’s ACL SIGHUM workshop on Language Technology for Cultural Heritage. This work is a collaboration between the Akademie der Wissenschaften zu Göttingen, Berlin-Brandenburgische Akademie der Wissenschaften, and Coptic Scriptorium.

The paper presents the structure and underlying principles of the dictionary and its Web interface, and also gives a quantitative analysis of the dictionary’s coverage of Coptic lexical material in corpus data. You can check out the pre-print here:

Feder, Frank, Kupreyev, Maxim, Manning, Emma, Schroeder, Caroline T. and Zeldes, Amir (2018) “A Linked Coptic Dictionary Online”. Proceedings of LaTeCH 2018 – The 11th SIGHUM Workshop at COLING2018. Santa Fe, NM.

[paper]

Automatically parsed OT and NT corpora

May 31, 2018 / Amir Zeldes / 0 Comments

We are pleased to announce that a new version of the automatically annotated New Testament and Old Testament corpora is now available online in Coptic Scriptorium!

The new version has substantially better automatic segmentation accuracy, and, for the first time, automatic syntactic parses for each verse. For more information on the syntax annotations, please see our previous post here:

https://blog.copticscriptorium.org/2018/05/07/coptic-treebank-2-2-moving-us-to-better-parsing/

Here are some example queries to get you started:

Search the Old Testament for Greek words: lang=”Greek”
Search both corpora for the morpheme “ⲙⲛⲧ”: morph=”ⲙⲛⲧ”
Find complement clauses of regular verbs in the New Testament: pos=”V” ->dep func=”ccomp”
Search 1 and 2 Corinthians for verse translations containing “brother”: translation=/.*brother.*/ & meta::title=/.*Corinthians.*/
Find cases of “ⲙⲉⲛ” indirectly followed by “ⲇⲉ” in both corpora: norm=”ⲙⲉⲛ” .* norm=”ⲇⲉ”

Thanks as always to the NEH and DFG for their support and to everyone who made the texts available, which come from the Sahidica version of the NT ((c) J. Warren Wells) and the OT text contributed by the CrossWire Bible Society SWORD Project, thanks to work by Christian Askeland, Matthias Schulz and Troy Griffitts.

Coptic Treebank 2.2 – moving us to better parsing!

May 7, 2018 / Amir Zeldes / 0 Comments

With the data release of Universal Dependencies 2.2, an update to the Coptic Treebank is now online! Thanks to work by Mitchell Abrams and Liz Davidson we’ve been able to add the first three chapters from 1 Corinthians and make numerous corrections. Another three chapters of 1 Corinthians and a portion of the Martyrdom of Victor the General are coming soon. You can see how we’ve been annotating and the documentation of our guidelines here:

http://universaldependencies.org/cop/

Thanks to the new data, automatic parsing has become somewhat more reliable, allowing us to add automatic parses to the most recent release. The results are better than before, but note we still only expect around 90% accuracy. To illustrate where the computer can’t do what humans can, here are two examples of a verb governing a subordinate verb in a clause marked by Ϫⲉ ‘that’. The subordinate verb usually has one of two labels:

ccomp if it’s a complement clause (I said that…)
advcl if it’s an adverbial clause, such as a causal clause (Ϫⲉ meaning ‘because’).

One of these examples was done by a human who got things right, the other contains a parser error – see if you can spot which is which!

More annotated texts: April 2018 release (v. 2.5.0)

April 27, 2018 / ctschroeder / 0 Comments

The Coptic Scriptorium team is pleased to announce the latest release of annotated Coptic corpora.

This release contains new text data contributed by Alin Suciu and Diliana Atanassova as part of the KELLIA project, as well as transcriptions and annotations from various Coptic SCRIPTORIUM project participants. New data in this release includes excerpts from:

The Canons of Apa Johannes (2,024 words)
Pseudo-Theophilus On the Cross and The Thief (4,543 words)
additional Apophthegmata Patrum, bringing the total released to 75 apophthegms (9,413 words)

All texts are also linked word-by-word to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/).

All corpora now also contain syntactic annotations derived from our tree-banking project. These annotations can be searched using the “func” annotation and visualized as treebanks.

Use our data by:

Browsing the Apa Johannes texts on our site in a format linked to the online dictionary
Search for mentions of “the cross” in Pseudo-Theophilus
Reading the Apophthegmata Patrum in aligned translation
Search for first-person subjects that are also proper names in the Apophthegmata Patrum
Download the data files from our GitHub site in multiple formats

We would like to thank the annotators and translators, without whose work the corpora would not be online. We thank the NEH and DFG for the necessary funding.

Coptic SCRIPTORIUM at ISAW Conference

April 20, 2018 / ctschroeder / 0 Comments

Coptic SCRIPTORIUM’s Caroline T. Schroeder will be giving the keynote at the “Future Philologies” Conference at Institute for the Study of the Ancient World today in New York City. We’re looking forward to conversations with colleagues old and new.