corpora – Page 2 – Coptic SCRIPTORIUM Blog

Tag: corpora (page 2 of 5)

Winter 2020 Corpora Release 3.1.0

It is our pleasure to announce a new data release, with a variety of new sources from our collaborators (including more digitized data courtesy of the Marcion and PAThs projects and other scholars). New in this release are:

Saints’ lives and martyrologies
- Martyrdom of Victor the General (parts 3-8; this work is now complete)
- Life of Aphou
- Life of Paul of Tamma
- Life of Phib
More works by Archimandrite Shenoute of Atripe:
Miscellaneous
- Three Discourses of Pseudo-Athanasius:
- The Instructions of Apa Pachomius
- Canons of Apa Johannes (5 new and revised documents, digital edition provided by Diliana Atanassova)

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

The new material in this release includes some 78,000 tokens in 33 documents and represents a tremendous amount of work by our project members and collaborators. We would like to thank the individual contributors (which you can find in the ‘annotation’ metadata), the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools, which we plan to make available in the summer. Please let us know if you have any feedback!

Fall 2019 Corpora Release 3.0.0

September 30, 2019 / Amir Zeldes / 0 Comments

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

Saints’ lives
- Life of Cyrus
- Life of Onnophrius
- Lives of Longinus and Lucius
- Martyrdom of Victor the General (part 2)
Miscellaneous:
- Dormition of John
- Homilies of Proclus
- Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
Apophthegmata Patrum
A large number of corrections to most of our existing corpora, which are being republished in this release.

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

Dealing with Heterogeneous Low Resource Data – Part I

August 26, 2019 / Amir Zeldes / 0 Comments

Image from Budge’s (1914), Coptic Martyrdoms
in the Dialect of Upper Egypt
(scan made available by archive.org)

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

A major challenge for Coptic Scriptorium as we expand to cover texts from other genres, with different authors, styles and transcription practices, is how to make everything uniform. For example, our previously released data has very specific transcription conventions with respect to what to spell together, based on Layton’s (2011:22-27) concept of bound groups, how to normalize spellings, what base forms to lemmatize words to, and how to segment and analyze groups of words internally.

An example of our standard is shown in below, with segments inside groups separated by ‘|’:

Coptic original: ⲉⲃⲟⲗ ϩⲙ̅|ⲡ|ⲣⲟ (Genesis 18:2)

Romanized: ebol hm|p|ro

Translation: out of the door

The words hm ‘in’, p ‘the’ and ro ‘door’ are spelled together, since they are phonologically bound: similarly to words spelled together in Arabic or Hebrew, the entire phrase carries one stress (on the word ‘door’) and no words may be inserted between them. Assimilation processes unique to the environment inside bound groups also occur, such as hm ‘in’, which is normally hn with an ‘n’, which becomes ‘m’ before the labial ‘p’, a process which does not occur across adjacent bound groups.

But many texts which we would like to make available online are transcribed using very different conventions, such as (2), from the Life of Cyrus, previously transcribed by the Marcion project following the convention of W. Budge’s (1914) edition:

Coptic original: ⲁ ⲡⲥ̅ⲏ̅ⲣ̅ ⲉⲓ ⲉ ⲃⲟⲗ ϩⲙ̅ ⲡⲣⲟ (Life of Cyrus, BritMusOriental6783)

Romanized: a p|sēr ei e bol hm p|ro

Gloss: did the|savior go to-out in the|door

Translation: The savior went out of the door

Budge’s edition usually (but not always) spells prepositions apart, articles together and the word ebol in two parts, e + bol. These specific cases are not hard to list, but others are more difficult: the past auxiliary is just a, and is usually spelled together with the subject, here ‘savior’. However, ‘savior’ has been spelled as an abbreviation: sēr for sōtēr, making it harder to recognize that a is followed by a noun and is likely to be the past tense marker, and not all cases of a should be bound. This is further complicated by the fact that words in the edition also break across lines, meaning we sometimes need to decide whether to fuse parts of words that are arbitrarily broken across typesetting boundaries as well.

The amount of material available in varying standards is too large to manually normalize each instance to a single form, raising the question of how we can deal with these automatically. In the next posts we will look at how white space can be normalized using training data, rule based morphology and machine learning tools, and how we can recover standard spellings to ensure uniform searchability and online dictionary linking.

References

Layton, B. (2011). A Coptic Grammar. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Budge, E.A.W. (1914) Coptic Martyrdoms in the Dialect of Upper Egypt. London: Oxford University Press.

Spring 2019 Corpora Release 2.7.0

June 11, 2019 / ctschroeder / 0 Comments

We at Coptic Scriptorium are pleased to version 2.7.0 of our corpora. The release includes several new documents:

several more sayings in the Coptic Apophthegmata Patrum (edited & annotated by Marina Ghaly)
additional fragments of Shenoute’s sermon Some Kinds of People Sift Dirt (edited & annotated by Christine Luckritz Marquis, editions provided by David Brakke)
Besa’s letter On Vigilance (edited and annotated by So Miyagawa and others)
several more fragments of the monastic canons of Apa Johannes (annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)

You can search all corpora at https://corpling.uis.georgetown.edu/annis/scriptorium and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format).

Our total annotated corpora are now at over 780,000 words; corpora that have human editors who reviewed the machine annotations amount to over 100,000 words.

Enjoy!

Screen Shot of Coptic Old Testament Documents

Corpora release 2.6

January 10, 2019 / ctschroeder / 0 Comments

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

Expanded Coptic Old Testament
More gold-standard treebanked texts
Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen. All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size. If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know. Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments. Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus. (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research. New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor. This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works. Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers. We recognize that versification is arbitrary, but nonetheless useful for citation. For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.) They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis. Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents. Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order. (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated? The answer is complicated — almost everything is machine annotated, with different levels of scholarly review. So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

automatic: fully machine annotated; no manual review or corrections to the tool output
checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release. Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

We hope you enjoy!

German PI Dr. Prof. Heike Behlmer and US PI Caroline T. Schroeder at Schroeder's recent visit to the Coptic Old Testament Project at the University and the Goettingen Academy.

Coptic Scriptorium’s summer adventures

July 20, 2018 / ctschroeder / 0 Comments

This has been a summer of writing, annotating, and conferencing!

German PI Dr. Prof. Heike Behlmer and US PI Caroline T. Schroeder at Schroeder’s recent visit to the Coptic Old Testament Project at the University and the Goettingen Academy.

We are winding up our collaborative grant with our German partners (Coptic Old Testament Project, the Thesaurus Linguae Aegyptiae, the DDGLC, and the INTF). Our German and US PI’s met in Göttingen, Germany, earlier this summer. We’re working on writing our final reports and exchanging data and technologies. We’re hoping to publish more annotated texts later this year.

We also have had a series of conference papers, including a paper on one of our collaboration’s proudest achievements, the online Coptic Dictionary. Here are some of the lectures and conference presentations this summer:

Miyagawa, So and Zeldes, Amir (2018) “A Semantic Map of the Coptic Complementizer če Based on Corpus Analysis: Grammaticalization and Areal Typology in Africa,” International Workshop on Semantic maps: Where do we stand and where are we going? Liège, Belgium. June.

Schroeder, Caroline T. (2018) “A Homily is a Homily is a Homily is a Corpus: Digital Approaches to Shenoute,” The Transmission of Early Christian Homilies from Late Antiquity to the Middle Ages Conference, Goethe-Universität Frankfurt am Main. June.

Schroeder, Caroline T. (2018) “Coptic Studies in the Digital Age,” Department of Ancient History, Macquarie University. July.

Schroeder, Caroline T. (2018) “Coptic Studies in a Digital Age,” UCLA-St. Shenouda Foundation Coptic Studies Conference, Los Angeles. July.

Feder, Frank,Maxim Kupreyev, Emma Manning, Caroline T. Schroeder, Amir Zeldes. “A Linked Coptic Dictionary Online”. Proceedings of LaTeCH 2018 – The 11th SIGHUM Workshop at COLING2018. Santa Fe, NM. August. [paper online]

As always, thanks to all our contributors, collaborators, and board members for their insight and labor.

Automatically parsed OT and NT corpora

May 31, 2018 / Amir Zeldes / 0 Comments

We are pleased to announce that a new version of the automatically annotated New Testament and Old Testament corpora is now available online in Coptic Scriptorium!

The new version has substantially better automatic segmentation accuracy, and, for the first time, automatic syntactic parses for each verse. For more information on the syntax annotations, please see our previous post here:

https://blog.copticscriptorium.org/2018/05/07/coptic-treebank-2-2-moving-us-to-better-parsing/

Here are some example queries to get you started:

Search the Old Testament for Greek words: lang=”Greek”
Search both corpora for the morpheme “ⲙⲛⲧ”: morph=”ⲙⲛⲧ”
Find complement clauses of regular verbs in the New Testament: pos=”V” ->dep func=”ccomp”
Search 1 and 2 Corinthians for verse translations containing “brother”: translation=/.*brother.*/ & meta::title=/.*Corinthians.*/
Find cases of “ⲙⲉⲛ” indirectly followed by “ⲇⲉ” in both corpora: norm=”ⲙⲉⲛ” .* norm=”ⲇⲉ”

Thanks as always to the NEH and DFG for their support and to everyone who made the texts available, which come from the Sahidica version of the NT ((c) J. Warren Wells) and the OT text contributed by the CrossWire Bible Society SWORD Project, thanks to work by Christian Askeland, Matthias Schulz and Troy Griffitts.

Coptic Treebank 2.2 – moving us to better parsing!

May 7, 2018 / Amir Zeldes / 0 Comments

With the data release of Universal Dependencies 2.2, an update to the Coptic Treebank is now online! Thanks to work by Mitchell Abrams and Liz Davidson we’ve been able to add the first three chapters from 1 Corinthians and make numerous corrections. Another three chapters of 1 Corinthians and a portion of the Martyrdom of Victor the General are coming soon. You can see how we’ve been annotating and the documentation of our guidelines here:

http://universaldependencies.org/cop/

Thanks to the new data, automatic parsing has become somewhat more reliable, allowing us to add automatic parses to the most recent release. The results are better than before, but note we still only expect around 90% accuracy. To illustrate where the computer can’t do what humans can, here are two examples of a verb governing a subordinate verb in a clause marked by Ϫⲉ ‘that’. The subordinate verb usually has one of two labels:

ccomp if it’s a complement clause (I said that…)
advcl if it’s an adverbial clause, such as a causal clause (Ϫⲉ meaning ‘because’).

One of these examples was done by a human who got things right, the other contains a parser error – see if you can spot which is which!

New corpora – release 2.4.0 is out!

November 23, 2017 / Amir Zeldes / 0 Comments

We are pleased to announce release version 2.4.0 with new corpora, with tagged and lemmatized corpora available for reading and download at [1], and fully searchable at [2]:

[1] http://data.copticscriptorium.org/

[2] https://corpling.uis.georgetown.edu/annis/scriptorium

This release contains new data contributed by Alin Suciu, David Brakke and Diliana Atanassova, as well as out of copyright edition material contributed by the Marcion project. New data in this release includes excerpts from:

The Martyrdom of Saint Victor the General (2033 tokens)
The Canons of Apa Johannes (438 tokens)
Pseudo-Theophilus On the Cross and The Thief (2814 tokens)
Shenoute, Some Kinds of People Sift Dirt (888 tokens)
11 additional Apophthegmata Patrum, bringing the total released to 63 apophthegms (7077 tokens)

All texts are also linked to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/), which has been updated with frequency information including these texts. We would like to thank the annotators and translators of these data sets, several of whom are new to the project, without whose work the corpora would not be online:

Alexander Turtureanu, Alin Suciu, Amir Zeldes, Caroline T. Schroeder, Christine Luckritz Marquis, Dana Robinson, David Brakke, David Sriboonreuang, Diliana Atanassova, Elizabeth Davidson, Elizabeth Platte, Gianna Zipp, J. Gregory Given, Janet Timbie, Jennifer Quigley, Laura Slaughter, Lauren McDermott, Marina Ghaly, Mitchell Abrams, Paul Lufter, Rebecca Krawiec, Saskia Franck and Tobias Paul

We hope everyone will find this release useful and look forward to releasing more data in the coming year!

New release of the Coptic Treebank

August 26, 2017 / Amir Zeldes / 0 Comments

Coptic Treebank release 2.1, now with three Letters of Besa!

We are pleased to announce the release of the latest version of the Coptic Treebank, now containing three Letters of Besa:

On Lack of Food
To Aphthonia
To Thieving Nuns

This brings the total corpus size up to 10,499 tokens, thanks to annotation work by Elizabeth Davidson and Amir Zeldes, building on earlier transcription and tagging work by Coptic Scriptorium and KELLIA partners. Special thanks are due to So Miyagawa for providing the transcription for On Lack of Food. The corpus will continue to grow as we work to annotate more data and improve the accuracy of our automatic syntax parser for Coptic. You can search the current version of the corpus in ANNIS here:

https://corpling.uis.georgetown.edu/annis/scriptorium

Or download the latest raw annotated data from GitHub here:

https://github.com/universalDependencies/UD_Coptic/tree/dev

Please let us know if you find any errors or have any feedback on the treebank!

Coptic SCRIPTORIUM Blog

Tag: corpora (page 2 of 5)

Winter 2020 Corpora Release 3.1.0

Fall 2019 Corpora Release 3.0.0

Dealing with Heterogeneous Low Resource Data – Part I

Spring 2019 Corpora Release 2.7.0

Corpora release 2.6

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

Coptic Scriptorium’s summer adventures

Automatically parsed OT and NT corpora

Coptic Treebank 2.2 – moving us to better parsing!

New corpora – release 2.4.0 is out!

New release of the Coptic Treebank

Recent Posts

Categories

Tags

Follow us on Twitter

Meta

Tag: corpora (page 2 of 5)

Share this:

Share this:

Share this:

Share this:

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Categories

Tags

Follow us on Twitter

Meta