Tag: data models

Dealing with Heterogeneous Low Resource Data – Part I

Image from Budge’s (1914), Coptic Martyrdoms in the Dialect of Upper Egypt

Image from Budge’s (1914), Coptic Martyrdoms
in the Dialect of Upper Egypt
(scan made available by archive.org)

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

A major challenge for Coptic Scriptorium as we expand to cover texts from other genres, with different authors, styles and transcription practices, is how to make everything uniform. For example, our previously released data has very specific transcription conventions with respect to what to spell together, based on Layton’s (2011:22-27) concept of bound groups, how to normalize spellings, what base forms to lemmatize words to, and how to segment and analyze groups of words internally.

An example of our standard is shown in below, with segments inside groups separated by ‘|’:

Coptic original:         ⲉⲃⲟⲗ ϩⲙ̅|ⲡ|ⲣⲟ    (Genesis 18:2)

Romanized:                 ebol hm|p|ro

Translation:                 out of the door

The words hm ‘in’, p ‘the’ and ro ‘door’ are spelled together, since they are phonologically bound: similarly to words spelled together in Arabic or Hebrew, the entire phrase carries one stress (on the word ‘door’) and no words may be inserted between them. Assimilation processes unique to the environment inside bound groups also occur, such as hm ‘in’, which is normally hn with an ‘n’, which becomes ‘m’ before the labial ‘p’, a process which does not occur across adjacent bound groups.

But many texts which we would like to make available online are transcribed using very different conventions, such as (2), from the Life of Cyrus, previously transcribed by the Marcion project following the convention of W. Budge’s (1914) edition:

 

Coptic original:    ⲁ    ⲡⲥ̅ⲏ̅ⲣ̅               ⲉⲓ  ⲉ ⲃⲟⲗ    ϩⲙ̅ ⲡⲣⲟ  (Life of Cyrus, BritMusOriental6783)

Romanized:           a     p|sēr              ei   e bol   hm p|ro

Gloss:                        did the|savior go to-out in the|door

Translation:          The savior went out of the door

 

Budge’s edition usually (but not always) spells prepositions apart, articles together and the word ebol in two parts, e + bol. These specific cases are not hard to list, but others are more difficult: the past auxiliary is just a, and is usually spelled together with the subject, here ‘savior’. However, ‘savior’ has been spelled as an abbreviation: sēr for sōtēr, making it harder to recognize that a is followed by a noun and is likely to be the past tense marker, and not all cases of a should be bound. This is further complicated by the fact that words in the edition also break across lines, meaning we sometimes need to decide whether to fuse parts of words that are arbitrarily broken across typesetting boundaries as well.

The amount of material available in varying standards is too large to manually normalize each instance to a single form, raising the question of how we can deal with these automatically. In the next posts we will look at how white space can be normalized using training data, rule based morphology and machine learning tools, and how we can recover standard spellings to ensure uniform searchability and online dictionary linking.

 

References

Layton, B. (2011). A Coptic Grammar. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Budge, E.A.W. (1914) Coptic Martyrdoms in the Dialect of Upper Egypt. London: Oxford University Press.

December 2016 corpus release (v 2.2.0)

We are happy to release the following new and revised documents to our corpora.  A copy of the official release notes is below.  The data is available for download from GitHub in TEI XML, PAULA XML, and relANNIS formats.  The corpora can be viewed and accessed at data.copticscriptorium.org, and they all can  be queried in ANNIS. We plan for another release with more documents in March 2017.

As always:  if you have comments or corrections, please submit a pull request on GitHub or send us an email at contact [at] copticscriptorium [dot] org.

____

This corpus release includes new or revised documents for:

  • 1 Corinthians: machine and manual annotations; new documents are chapters 13-16; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Mark: machine and manual annotations; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Not Because a Fox Barks (Shenoute): machine and manual annotations; edits to already published document include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Besa letters: machine and manual annotations; edits to already published documents include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines

All other documents in our corpora are unchanged from the last release.

New metadata and corpus feature: We are beginning to add to our documents a metadata field called “order” which will allow us to present documents in a logical order for browsing or reading. We’ve implemented it in the Besa letters, corpus and will roll it out for other corpora in the future. Our Document Retrieval web application (data.copticscriptorium.org) now lists the documents in the order in which they appear in the manuscript tradition, when you filter for that corpus. Thus, users who wish to read or browse the documents in that order can do so easily.

Version control: We have set the version number on our document metadata, corpus metadata (in ANNIS), and release information (in GitHub) all to match. Version #s and dates are only revised when a document is revised. So if no documents in our AP corpus have been revised and republished, or no new documents for that corpus have been published, then the version # on the documents and corpus do not change. Only new and newly edited documents (and their corpora) will have version 2.2.0 and date 08 December 2016 in their metadata.

NEH White Paper (Preservations and Access Grant) published

We at Coptic SCRIPTORIUM have been fortunate to have received three grants from the National Endowment for the Humanities for our work.   We cannot thank the NEH enough for its support.  So much of what we have done over the past 2+ years could not have happened without this funding.

We just completed a White Paper paper for a Foundations grant from the Humanities Collections and Reference Resources program in the Division of Preservation and Access.  The grant, “Coptic SCRIPTORIUM: Digitizing a Corpus for Interdisciplinary Research in Ancient Egyptian,” ran from May 2104 until now.

Our White Paper documents our work and especially the standards and practices we developed for digitizing a pilot Coptic corpus.

If you want to know more about what truly interdisciplinary DH work looks like, check it out.  We try to break down the complexities of creating a digital corpus for research in linguistics, history, religious studies, biblical studies, manuscript studies.  We’ve got data models, workflows, digitization standards, transcription guidelines, and more all laid out for you here.

There is so much more to do; this is a only start.  Thanks to everyone who has had faith in our work.

White Paper, NEH Grant PW-51672-14 (Preservation and Access): “Coptic SCRIPTORIUM: Digitizing a Corpus for Interdisciplinary Research in Ancient Egyptian” 29 August 2016

New web application to read documents, cite data, and access data (BETA release)

We’re very excited to announce a new feature at Coptic SCRIPTORIUM.  We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations.  There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups.   This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out.  (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.)  We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.

 

 

Download release of all corpora in TEI XML, PAULA XML, relANNIS

We’ve released some new corpora (the papyri.info texts, for example) and some new documents to our existing corpora.  You can download everything in three different formats from our GitHub repository.  TEI XML, PAULA XML, and relANNIS.

Introducing the project texts and data model, and how to use ANNIS

To learn more about Coptic SCRIPTORIUM’s corpora, data model, and features,  here is a video on how to use the tool ANNIS into the world of Coptic. Thanks goes to Caroline T. Schroeder for the video from her youtube channel.

(Originally posted on copticscriptorium.org)