ANNIS – Coptic SCRIPTORIUM Blog

Tag: ANNIS (page 1 of 3)

Latest Release of Coptic Corpora

We are pleased to announce the release of version 6.2.0 of the Coptic Scriptorium data! Our corpora now total 2,375,875 words. This release provides significant new annotated data in both the Bohairic and Sahidic Dialects:

New parts of works in Bohairic, including:
- The Life of Shenoute, Parts 2 & 3 (part 1 was released in September)
- The Lausiac History, Parts 2 & 3 (part 1 was released in September)
New corpora
- The Gospel of Thomas, edited from the manuscript by Paul Dilley
- The Sahidic book of Jonah (with manual edits and corrections to NLP annotations by Stephan Claassen; the automatically processed Jonah is in the Coptic OT corpus)
New documents in the following existing corpora:
- Apophthegmata Patrum
- Shenoute’s work known as Acephalous Work 22
More Arabic and English translations for documents previously published

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder, Amir Zeldes, Nicholas Wagner, and Paul Dilley as well as Nina Speranskaja, Rebecca Krawiec, Christine Luckritz Marquis, Stephan Claassen, Philippe Zaher, and Safaa Mahfouz. We also want to thank Hany Takla and the St. Shenouda the Archimandrite Coptic Society for their collaborations and support. Additionally, we thank our donors for contributions that made much of the work on this release possible. Please consider supporting Coptic Scriptorium as we navigate the new funding environment in the USA.

As with all our releases, the raw machine-readable data for all corpora—including morphological and syntactic annotations, as well as named entity recognition—are available in our GitHub repository. Data can be downloaded in a variety of popular formats to suit your research needs.

You can read and browse entire documents in an online portal. Our corpora are also linked in entries on the Coptic Dictionary Online.

For searching, including advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet. Currently, the Arabic translations are only available in ANNIS, as well.

grey and white database interface with Coptic text

New Webinar Video on Searching Our Database Now Online

July 16, 2024 / ctschroeder / 0 Comments

Earlier today, the Coptic Scriptorium project hosted an online workshop/webinar on searching text and annotations in our database (ANNIS). The video is now on YouTube. The cheat-sheet with an online tutorial that Dr. Zeldes shows in this video is on our website.

Webinar/online workshop on how to search the Coptic Scriptorium database (ANNIS)

If you watch the video, we’d also appreciate your feedback in this brief survey.

We thank the National Endowment for the Humanities, the University of Oklahoma, and Georgetown University for supporting the project and this workshop.

New links for tools and services

August 5, 2022 / ctschroeder / 0 Comments

After our recent server outage, we’ve been re-installing our tools and software. Some of our services are now available at new URLs.

The ANNIS database is now at https://annis.copticscriptorium.org/annis/scriptorium

Our Sahidic Coptic natural language processing tools are at https://tools.copticscriptorium.org/coptic-nlp

Our GitDox annotation tool is at https://tools.copticscriptorium.org/gitdox/scriptorium

The Coptic Dictionary online is still at https://coptic-dictionary.org, and our tool for browsing and reading texts is still at https://data.copticscriptorium.org

Thanks for your patience!

Coptic Dictionary and ANNIS database down

June 19, 2022 / ctschroeder / 0 Comments

We are sorry to report that the server that hosts the Coptic Dictionary Online and Coptic Scriptorium’s ANNIS database are down. (Likewise some of the NLP tools and internal tools like GitDox are down.)

We are working on fixing the problem, but for now we do not have a timeline for when they will be up and running.

In the meantime reading and browsing texts at http://data.copticscriptorium.org still work.

Thank you for your patience! We will let you know when the systems are up again.

Screen Shot of Coptic Old Testament Documents

Corpora release 2.6

January 10, 2019 / ctschroeder / 0 Comments

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

Expanded Coptic Old Testament
More gold-standard treebanked texts
Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen. All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size. If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know. Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments. Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus. (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research. New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor. This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works. Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers. We recognize that versification is arbitrary, but nonetheless useful for citation. For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.) They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis. Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents. Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order. (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated? The answer is complicated — almost everything is machine annotated, with different levels of scholarly review. So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

automatic: fully machine annotated; no manual review or corrections to the tool output
checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release. Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

We hope you enjoy!

New release of the Coptic Treebank

August 26, 2017 / Amir Zeldes / 0 Comments

Coptic Treebank release 2.1, now with three Letters of Besa!

We are pleased to announce the release of the latest version of the Coptic Treebank, now containing three Letters of Besa:

On Lack of Food
To Aphthonia
To Thieving Nuns

This brings the total corpus size up to 10,499 tokens, thanks to annotation work by Elizabeth Davidson and Amir Zeldes, building on earlier transcription and tagging work by Coptic Scriptorium and KELLIA partners. Special thanks are due to So Miyagawa for providing the transcription for On Lack of Food. The corpus will continue to grow as we work to annotate more data and improve the accuracy of our automatic syntax parser for Coptic. You can search the current version of the corpus in ANNIS here:

https://corpling.uis.georgetown.edu/annis/scriptorium

Or download the latest raw annotated data from GitHub here:

https://github.com/universalDependencies/UD_Coptic/tree/dev

Please let us know if you find any errors or have any feedback on the treebank!

Old Testament corpus release

June 12, 2017 / Amir Zeldes / 0 Comments

We are happy to announce the release of the automatically annotated Sahidic Old Testament corpus (corpus identifier: sahidic.ot), based on the version of the available texts kindly provided by the CrossWire Bible Society SWORD Project thanks to work by Christian Askeland, Matthias Schulz and Troy Griffitts.

The corpus is available for search in ANNIS, much like the Sahidica New Testament corpus, together with word segmentation, morphological analysis, language of origin for loanwords, part of speech tagging and automatically aligned verse translations (except for parts of Jeremiah). Please expect some errors, due the fully automatic analysis in the corpus. The aligned translation is taken from the World English Bible. Here is an example search for the word ‘soul’:

norm=”ⲯⲩⲭⲏ”

You can also read entire chapters in ANNIS or at our repository, which look like this:

urn:cts:copticLit:ot.gen.crosswire:09

We hope that this resource will be helpful to Coptic scholars – please let us know if you have any questions or comments!

New Tutorials & Recent Workshop Wrap-up

June 8, 2017 / ctschroeder / 0 Comments

Coptic Scriptorium team members Caroline T. Schroeder and Rebecca Krawiec recently led a workshop on Digital Corpora and Digital Editions at the North American Patristics Society annual meeting. We created detailed tutorials useful to both beginners and more advanced users on our GitHub site. These tutorials cover:

an introduction to digital editions and corpora
working with the online Coptic Dictionary
simple and complex searching Coptic literature in our database ANNIS
creating a digital corpus with Epidoc TEI-XML annotations and natural language processing

We invite everyone to use these tutorials on their own. They’re designed for for self-paced work.

We were pleased to participate in the pre-conference Digital Humanities workshops that included another session on mapping led by Sarah Bond and Jennifer Barry. We had attendees from four countries, who ranged in their careers from graduate students to senior professors. Thanks to NAPS for hosting these workshops, and to the NEH and the DFG for making our work possible.

New Release of Corpora

April 25, 2017 / ctschroeder / 0 Comments

We’re pleased to announce that we’ve released more texts in our corpora.

The Sayings of the Desert Fathers (Apophthegmata Patrum) corpus now contains 52 sayings/apophthegms (>7100 words). We have edited previously published sayings for consistency in annotation, and we’ve released new sayings edited by Christine Luckritz Marquis, Elizabeth Platte, and our newest contributor, Dana Robinson. Read or browse the Sayings online. Click on the “Analytic” button to see read a saying in Coptic with a parallel English translation + part of speech tags for each Coptic word.

Or click on the “Norm” button (short for “normalized”) to read the Coptic. Clicking on any Coptic word in the normalized visualization will take you to an online Coptic-English dictionary. Hovering your cursor over a passage in the normalized visualization will show the English translation in a pop up window.

AP 96 Normalized view screenshot

Shenoute’s I See Your Eagerness now has numerous new manuscript fragments published (over 16,000 words). We also have edited previously published witnesses for consistency in annotation. These documents were transcribed and collated from the manuscripts by David Brakke and annotated for digital publication by Rebecca Krawiec. Now you can read Shenoute’s I See Your Eagerness in nearly its entirety in Coptic. We provide several paths for you to explore this text:

Read the text from start to end, beginning with the first manuscript fragment. Click “NEXT” to keep reading.
MONB.GL fragment D diplomatic visualization

(No English translation is provided, but in the “Note” metadata field below the Coptic, you can find page numbers for David Brakke’s and Andrew Crislip’s translation in their book, Discourses of Shenoute.) “Next” and “Previous” buttons will take you through the path we consider optimal for reading the text. This path wanders through various manuscript witnesses, following the path with the fewest lacunae. Want to see parallel witnesses? Check out the “Witness” metadata field below the text.

MONB.GL 29-30 metadata screenshot
Read through all surviving pages in one codex/manuscript witness by filtering for a particular codex. Click through the documents in that codex. For example, if you want to read through all the fragments of codex MONB.GL, go to data.copticscriptorium.org, and use the menu to filter by Corpus for the shenoute.eagerness corpus, and then filter by manuscript name for the MONB.GL codex. Click through the documents in that codex.
Perform a search/query in our ANNIS database. For example, search for all occurrences of “wicked” (ⲡⲟⲛⲏⲣⲟⲛ) in the corpus. Or, search for occurrences of “wicked” controlling for duplicate hits in parallel manuscript witnesses. See our guide to queries in ANNIS for more tips.

You also can download the entire corpus in TEI XML, PAULA XML, and relANNIS formats from our GitHub site.

New release – Coptic Treebank V2

March 16, 2017 / Amir Zeldes / 0 Comments

We are happy to announce the release of version 2 of the Coptic Universal Dependency Treebank. With over 8,500 tokens from 14 documents, the Treebank is the largest syntactically annotated resource in Coptic. The annotation scheme follows the Universal Dependency Guidelines, version 2, and is therefore comparable with UD data from 70 treebanks in 50 languages, including English, Latin, Classical Greek, Arabic, Hebrew and more.

You can search in the Treebank using ANNIS. For example, the following query finds cases of verbs dominating a complement clause (e.g. “say …. that …”):

pos="V" ->dep[func="ccomp"] norm

[Link to this query]

More about the Treebank:
https://corpling.uis.georgetown.edu/coptic-treebank/
More about Universal Dependencies:
http://universaldependencies.org/

Coptic SCRIPTORIUM Blog

Tag: ANNIS (page 1 of 3)

Latest Release of Coptic Corpora

New Webinar Video on Searching Our Database Now Online

New links for tools and services

Coptic Dictionary and ANNIS database down

Corpora release 2.6

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

New release of the Coptic Treebank

Old Testament corpus release

New Tutorials & Recent Workshop Wrap-up

New Release of Corpora

New release – Coptic Treebank V2

Recent Posts

Categories

Tags

Follow us on Twitter

Recent Posts

Categories

Meta

Tags

Follow us on Twitter

Meta

Tag: ANNIS (page 1 of 3)

Share this:

Share this:

Share this:

Share this:

Expanded Coptic Old Testament

Treebanks

Updated Shenoute Documents

New Metadata Fields Documenting Annotation

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Recent Posts

Categories

Tags

Follow us on Twitter

Recent Posts

Categories

Meta

Tags

Follow us on Twitter

Meta