Page 4 of 10

Dealing with Heterogeneous Low Resource Data – Part I

Image from Budge’s (1914), Coptic Martyrdoms in the Dialect of Upper Egypt

Image from Budge’s (1914), Coptic Martyrdoms
in the Dialect of Upper Egypt
(scan made available by archive.org)

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

A major challenge for Coptic Scriptorium as we expand to cover texts from other genres, with different authors, styles and transcription practices, is how to make everything uniform. For example, our previously released data has very specific transcription conventions with respect to what to spell together, based on Layton’s (2011:22-27) concept of bound groups, how to normalize spellings, what base forms to lemmatize words to, and how to segment and analyze groups of words internally.

An example of our standard is shown in below, with segments inside groups separated by ‘|’:

Coptic original:         ⲉⲃⲟⲗ ϩⲙ̅|ⲡ|ⲣⲟ    (Genesis 18:2)

Romanized:                 ebol hm|p|ro

Translation:                 out of the door

The words hm ‘in’, p ‘the’ and ro ‘door’ are spelled together, since they are phonologically bound: similarly to words spelled together in Arabic or Hebrew, the entire phrase carries one stress (on the word ‘door’) and no words may be inserted between them. Assimilation processes unique to the environment inside bound groups also occur, such as hm ‘in’, which is normally hn with an ‘n’, which becomes ‘m’ before the labial ‘p’, a process which does not occur across adjacent bound groups.

But many texts which we would like to make available online are transcribed using very different conventions, such as (2), from the Life of Cyrus, previously transcribed by the Marcion project following the convention of W. Budge’s (1914) edition:

 

Coptic original:    ⲁ    ⲡⲥ̅ⲏ̅ⲣ̅               ⲉⲓ  ⲉ ⲃⲟⲗ    ϩⲙ̅ ⲡⲣⲟ  (Life of Cyrus, BritMusOriental6783)

Romanized:           a     p|sēr              ei   e bol   hm p|ro

Gloss:                        did the|savior go to-out in the|door

Translation:          The savior went out of the door

 

Budge’s edition usually (but not always) spells prepositions apart, articles together and the word ebol in two parts, e + bol. These specific cases are not hard to list, but others are more difficult: the past auxiliary is just a, and is usually spelled together with the subject, here ‘savior’. However, ‘savior’ has been spelled as an abbreviation: sēr for sōtēr, making it harder to recognize that a is followed by a noun and is likely to be the past tense marker, and not all cases of a should be bound. This is further complicated by the fact that words in the edition also break across lines, meaning we sometimes need to decide whether to fuse parts of words that are arbitrarily broken across typesetting boundaries as well.

The amount of material available in varying standards is too large to manually normalize each instance to a single form, raising the question of how we can deal with these automatically. In the next posts we will look at how white space can be normalized using training data, rule based morphology and machine learning tools, and how we can recover standard spellings to ensure uniform searchability and online dictionary linking.

 

References

Layton, B. (2011). A Coptic Grammar. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Budge, E.A.W. (1914) Coptic Martyrdoms in the Dialect of Upper Egypt. London: Oxford University Press.

On the Road Summer 2019

Coptic Scriptorium is busy this summer conference season.

I had the privilege of teaching one of the Sunoikisis Digital Classicist summer session earlier in July.

UCLA-St Shenouda Society image

The UCLA-St Shenouda Society conference participants, 2019

I also presented some research on girls and girlhood using the Coptic Scriptorium Corpora and the Online Coptic Dictionary at the annual UCLA-St. Shenouda Society Coptic Studies Conference.  This year was the 20th anniversary conference, and the theme was Shenoute and the White Monastery.

C. Schroeder presenting at ACH 2019; photo courtesy Melissa Dollman via Twitter

C. Schroeder presenting at ACH 2019; photo courtesy Melissa Dollman via Twitter

This week,  the American Digital Humanities organization, the Association for Computational Humanities, held a conference in Pittsburgh.  There I talked about colonialism, Coptic manuscripts, and resisting continuing colonialist tendencies in digitizing these manuscripts.

Meanwhile we’ve also been working on digitizing and annotating more texts, which we hope to release in the fall.

Happy summer everyone!

Congratulating our colleagues!

Two pillars in the fields of Digital Humanities, cultural heritage, and the manuscripts of the Eastern Mediterranean world received honors this month, and we at Coptic Scriptorium wish to congratulate them both.

Orlandi-by-Ciotti-DH2019

Tito Orlandi at DH2019, photo by Fabio Ciotti via Twitter

Dr.  Tito Orlandi was awarded the Busa Prize for lifetime achievement by the Alliance of Digital Humanities Organizations at the annual Digital Humanities Conference in Utrecht.  This honor is bestowed only every three years and thus is quite a distinguished award. Tito’s work in text encoding, developing stable identifiers for manuscripts, digital lexica, and digitization has been foundational for Coptic Studies.  He founded the Corpus dei Manoscritti Copti Letterari
(CMCL) project.

Columba Stewart, OSB, D.Phil, photograph from the HMML site

Columba Stewart, OSB, D.Phil, from the HMML site

Dr. Columba Stewart, Director of the Hill Museum and Manuscript Library at St. John’s University, has been named the Jefferson Lecturer by the National Endowment for the Humanities for 2019.  Other luminaries who have received this honor include Toni Morrison, John Updike, and others.  Columba’s scholarship on early monasticism—especially Evagrius and Cassian—is well-known, widely respected, and oft-cited.  He is being honored by the NEH in particular for his work at HMML to collaborate with communities in the Middle East to photograph and preserve manuscripts manuscripts from both Christian and Muslim communities and traditions that are endangered for various political, cultural, and geographic reasons.

On a personal note, Tito has been a supportive colleague long before Coptic Scriptorium existed.  At my first Congress of the International Association of Coptic Studies in Leiden in 2000, Tito chaired the session in which I gave my paper.  I will never forget when my slides first went up on the screen with one of my photographs of the White Monastery Church, he warmly remarked how happy it made him to see the White Monastery.   This sounds like a small thing, but for a grad student at this international conference for the first time, it was a reassuring way to start my paper.  When we began Coptic Scriptorium, Tito shared with us his digital lexica, which allowed us to shave at least a year off of our labors. Conversations with Tito over the years have enriched our work.

Likewise, Columba has been a kind and generous colleague and mentor since we first met in 1999 at the Oxford Patristics Conference.  Columba’s research on early monasticism has inspired me for a long time, and his work at HMML and the vHMML online reading room is a model for public-facing cultural heritage preservation and collaborations between American scholars and heritage communities in the Middle East.  Columba’s work is sometimes framed as saving manuscripts from ISIS, but Columba himself talks about the American role in the loss of cultural heritage in the Middle East and is, in my opinion, open about the geopolitical and colonialist power dynamics at work. As I said more informally to some friends on social media, Columba is 100% the real deal.

Additionally, for those of us who work on Christianity in the ancient Eastern Mediterranean and the languages and manuscripts of these communities, these two awards cast a warm glow over the whole field.  Thank you Columba and Tito for your work, and thank you to the ADHO and the NEH for honoring them and by extension their areas of work.

A warm, sincere congratulations to you both!

Comprehensive Coptic Lexicon v1 on Coptic Dictionary Online

The “Database and Dictionary of Greek Loanwords in Coptic” (DDGLC, Freie Universität Berlin), the research project “Strukturen und Transformationen des Wortschatzes der ägyptischen Sprache ”Thesaurus Linguae Aegyptiae” (TLA, Berlin-Brandenburgische Akademie der Wissenschaften) and “Coptic Scriptorium: Digital Research in Coptic Language and Literature” are happy to announce the release of version 1 of the “Comprehensive Coptic Lexicon“. The processed data has been published by the Coptic Dictionary Online:

  • Coptic Dictionary Online, ed. by the Koptische/Coptic Electronic Language and Literature International Alliance (KELLIA), https://coptic-dictionary.org/

The raw data can be downloaded at:

  • D. Burns, F. Feder, K. John, M. Kupreyev, et al. 12.5.2019. Comprehensive Coptic Lexicon: Including Loanwords from Ancient Greek, Berlin: Freie Universität Berlin, https://doi.org/10.17169/refubium-2333

The Comprehensive Coptic Lexicon includes ca. 8,000 Egyptian-Coptic lemmata with ca. 10,000 word forms, as well as ca. 3,250 Greek-Coptic lemmata with ca. 10,000 forms.

DDGLC, TLA and Coptic Scriptorium invite you to take a look at the new data and would welcome your feedback.

We’re hiring!

We are hiring a Digital Humanities specialist!  The full job ad is on the Georgetown U HR site. In short:

  • this is a part-time (12-15 hrs/week) paid DH Specialist position (so paid but no benefits);
  • knowledge of Coptic “strongly preferred”;
  • needs to be able to work in the United States legally;
  • living in Washington, DC, preferred but not required- remote ok if willing to attend virtual meetings and travel to DC on occasion for team meetings;
  • digital skills a plus but not required

Perfect for a grad student or other researcher in Classics, Linguistics, Near Eastern Studies, Ancient History, Papyrology, Religious Studies, etc., looking to add part-time work to their schedule AND expand their digital skill set.  Apply via the site! We are looking to hire soon.

 

Spring 2019 Corpora Release 2.7.0

We at Coptic Scriptorium are pleased to version 2.7.0 of our corpora.  The release includes several new documents:

  • several more sayings in the Coptic Apophthegmata Patrum (edited & annotated by Marina Ghaly)
  • additional fragments of Shenoute’s sermon Some Kinds of People Sift Dirt (edited & annotated by Christine Luckritz Marquis, editions provided by David Brakke)
  • Besa’s letter On Vigilance (edited and annotated by So Miyagawa and others)
  • several more fragments of the monastic canons of Apa Johannes (annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora at https://corpling.uis.georgetown.edu/annis/scriptorium and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format).

Our total annotated corpora are now at over 780,000 words; corpora that have human editors who reviewed the machine annotations amount to over 100,000 words.

Enjoy!

Corpora release 2.6

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

  • Expanded Coptic Old Testament
  • More gold-standard treebanked texts
  • Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
  • New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen.  All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

  1. analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
  2. chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
  3. ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size.  If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know.  Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments.  Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus.  (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research.  New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor.  This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works.  Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers.  We recognize that versification is arbitrary, but nonetheless useful for citation.  For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.)  They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis.  Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents.  Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order.  (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated?  The answer is complicated — almost everything is machine annotated, with different levels of scholarly review.  So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

  • Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
  • Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
  • Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

  • automatic: fully machine annotated; no manual review or corrections to the tool output
  • checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
  • gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

 

Screenshot of document metadata showing checked word segmentation and tagging

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

Screenshot of document metadata showing gold level annotations

 

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release.  Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

 

We hope you enjoy!

 

Rebecca Krawiec Featured Digital Humanities Researcher at Canisius

Rebecca Krawiec presenting in the Canisius College Digital Humanities speaker series

Rebecca Krawiec presenting in the Canisius College Digital Humanities speaker series

Project participant Rebecca Krawiec, Professor and Chair of Religious Studies and Theology at Canisius College, presented her work with Coptic Scriptorium as part of the Digital Humanities speaker series at Canisius. Her talk, “Studying Ancient Egyptian Christianity in a Modern Digital World,” discussed how the many layers of annotation in Coptic Scriptorium’s corpora enhance research into late antique Christianity.  Read a description of this event on the Canisius College website or watch the lecture online.

Recent presentations by Coptic Scriptorium team members (post 1 of 2)!

This fall, Coptic Scriptorium team members have presented their work in a number of environments.

Research Talk, Georgetown University Linguistics Speaker Series

In September, as part of the Georgetown University Department of Linguistics Friday Speaker Series, the project presented a summary of our latest work and our goals for the new NEH Digital Humanities Advancement Grant we received. “A Linked Digital Environment for Coptic Studies”.  Caroline T. Schroeder provided an overview of the project. Amir Zeldes presented the technology required to machine-process Coptic text in order to produce an annotated, digital corpus and linked online lexicon. Rebecca Krawiec discussed the research potential of an annotated digital corpus for research in early monasticism. Elizabeth Platte introduced the concept of linked data and demonstrated our linked geographic data features. (Christine Luckritz Marquis was scheduled present research on space and place in monastic literature but was unfortunately sidelined by a hurricane.)

Rebecca Krawiec, Elizabeth Platte, Amir Zeldes, Caroline T. Schroeder at Georgetown University, 2018

Rebecca Krawiec, Elizabeth Platte, Amir Zeldes, Caroline T. Schroeder at Georgetown University, 2018

Material of Christian Apocrypha Conference

In December, Caroline T. Schroeder gave a paper at the Material of Christian Apocrypha Conference hosted at the University of Virginia, under the auspices of the North American Society for the Study of Christian Apocryphal Literature.  Dr. Schroeder’s paper, “The Materiality of Digital Apocryphal Studies,” addressed the role of digital humanities in studying the colonial history of manuscripts, people and places in early Christian literature, and public humanities.  It was part of a panel on Christian Apocrypha and the Digital Humanities, which also included papers by James Walters (Rochester College) on “The Digital Syriac Corpus: A New Resource for the Study of Syriac Texts” and  Brandon Hawk (Rhode Island College) on “The Medieval Social Network of the Gospel of Pseudo-Matthew”.  Datasets used in the presentation are available at Dr. Schroeder’s GitHub site.

Caroline T. Schroeder presenting about the manuscripts digitized by Coptic Scriptorium

Caroline T. Schroeder presenting about the manuscripts digitized by Coptic Scriptorium

Caroline T. Schroeder presenting visualizations of occurrences of proper names in some of Coptic Scriptorium's corpora

Caroline T. Schroeder presenting visualizations of occurrences of proper names in some of Coptic Scriptorium’s corpora

New features in our NLP pipeline

Coptic Scriptorium’s Natural Language Processing (NLP) tools now support two new features:

  • Multiword expression recognition
  • Detokenization (bound group re-merging)

Kicking off work on the new phase of our project, these new tools will improve inter-operability of Coptic data across corpora, lexical resources and projects:

Multiword expressions

The multiword expression ⲉⲃⲟⲗ ϩⲛ "out of" (from Apophthegmata Patrum 27, MOBG EG 67. Image: Österreichische Nationalbibliothek)

The multiword expression ⲉⲃⲟⲗ ϩⲛ “out of” (from Apophthegmata Patrum 27, MOBG EG 67. Image: Österreichische Nationalbibliothek)

Although lemmatization and normalization already offer a good way of finding base forms of Coptic words, many complex expressions cross word borders in Coptic. For example, although it is possible to understand combinations such as ⲉⲃⲟⲗ ‘out’ + ϩⲛ ‘in’, or ⲥⲱⲧⲙ ‘hear’ + ⲛⲥⲁ ‘behind’ approximately from the meaning of each word, together they have special senses, such as ‘out of’ and ‘obey’ respectively.  This and similar combinations are distinct enough from their constituents that they receive different lexicon entries in dictionaries, for example in the Coptic Dictionary Online (CDO), compare: ⲥⲱⲧⲙ, ⲛⲥⲁ and ⲥⲱⲧⲙ ⲛⲥⲁ.

Thanks to the availability of the CDO’s data,  the NLP tools can now attempt to detect known multiword expressions, which can then be linked back to the dictionary and used to collect frequencies for complex items. 

Many thanks to Maxim Kupreyev for his help in setting up multiword expressions in the dictionary, as well as to Frank Feder, So Miyagawa, Sebastian Richter and other KELLIA collaborators for making these lexical resources available!

Detokenization

Coptic bound groups have been written with intervening spaces according to a number of similar but subtly different traditions, such as Walter Till’s system and the system employed by Bentley Layton’s  Coptic Grammar, which Coptic Scriptorium employs. The differences between these and other segmentation traditions can create some problems:

  1. Users searching in multiple corpora may be surprised when queries behave differently due to segmentation differences.
  2. Machine learning tools trained on one standard degrade in performance when the data they analyze uses a different standard.

In order to address these issues and have more consistent and more accurately analyzed data, we have added a component to our tools which can attempt to merge bound groups into ‘Laytonian’ bound groups. In Computational Linguistics, re-segmenting a segmented text is referred to as ‘detokenization’, but for our tools this has also been affectionately termed ‘Laytonization’. The new detokenizer has several options to choose from:

  1. No merging – this is the behavior of our tools to date, no modifications are undertaken.
  2. Conservative merging mode – in conservative merging, only items known to be spelled apart in different segmentations are merged. For example, in the sequence ϩⲙ ⲡⲏⲓ “in the-house”, the word ϩⲙ “in” is typically spelled apart in Till’s system, but together in Layton’s. This type of sequence would be merged in conservative mode.
  3. Aggressive merging mode – in this mode, anything that is most often spelled bound in our training data is merged. This is done even if the segment being bound by the system is not not one that would normally be spelled apart in some other conventional system. For example, the sequence ⲁ ϥⲥⲱⲧⲙ “(PAST) he heard”, the past tense marker is a unit that no Coptic orthographic convention spells apart. It is relatively unlikely that it should stand apart in normal Coptic text in any convention, so in aggressive mode it would be merged as well.
  4. Segment at merge point – regardless of the merging mode chosen, if any merging occurs, this option enforces the presence of a morphological boundary at any merge point. This ensures that merged items are always assigned to separate underlying words, and receive part of speech annotations accordingly, even if our machine learning segmenter does not predict that the merged bound group should be segmented in this way.

The use of these options is expected to correspond more or less to the type of input text: for carefully edited text from a different convention (e.g. Till), conservative merging is with segmentation at merge points is recommended. For ‘messier’ text (e.g. older digitized editions with varying conventions, such as editions by Wallis Bugde, or material from automatic Optical Character Recognition), aggressive merging is advised, and we may not necessarily want to assume that segments should be introduced at merge points.

We hope these tools will be useful and expect to see them create more consistency, higher accuracy and inter-operability between resources in the near future!

« Older posts Newer posts »