citation – Coptic SCRIPTORIUM Blog

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

Expanded Coptic Old Testament
More gold-standard treebanked texts
Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen. All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size. If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know. Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments. Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus. (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research. New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor. This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works. Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers. We recognize that versification is arbitrary, but nonetheless useful for citation. For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.) They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis. Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents. Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order. (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated? The answer is complicated — almost everything is machine annotated, with different levels of scholarly review. So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

automatic: fully machine annotated; no manual review or corrections to the tool output
checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release. Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

We hope you enjoy!

We’re very excited to announce a new feature at Coptic SCRIPTORIUM. We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations. There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups. This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out. (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.) We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.

Coptic SCRIPTORIUM Blog

Tag: citation

Corpora release 2.6

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

ANNIS embeds for websites and blogs

New web application to read documents, cite data, and access data (BETA release)

Recent Posts

Categories

Tags

Follow us on Twitter

Meta

Coptic SCRIPTORIUM Blog

Tag: citation

Corpora release 2.6

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

Share this:

ANNIS embeds for websites and blogs

Share this:

New web application to read documents, cite data, and access data (BETA release)

Share this:

Recent Posts

Categories

Tags

Follow us on Twitter

Meta