Tag: Shenoute (page 1 of 2)

New Corpora Release 6.0.0

Searching for Greek loanwords in Bohairic Habakkuk

We are pleased to announce the release of version 6.0.0 of Coptic Scriptorium! Our corpus has been dramatically expanded in this release, now exceeding 2.2 million tokens of searchable, linguistically annotated Coptic texts. Among the highlights of this update is the exponential growth of our Bohairic corpus, now comprising approximately 750,000 words and featuring translated texts such as the Bohairic Bible (Old and New Testament), as well as original works such as the Life of Isaac. This milestone brings substantial enhancements to our collections, including modern editions processed with Optical Character Recognition (OCR) technology alongside both new and updated Coptic texts.

New OCR Material and Automatic Tagging

This release includes the addition of OCR-based editions. For the first time, fully automated tagging has been applied to a selection of OCR datasets:

  1. Budge Texts
  2. Lacau, Acts of Pilate
  3. Lacau, Lament of Mary
  4. Sobhy, Martyrdom and Encomium of Helias

Version 6.0.0 also includes several newly curated corpora, reflecting a diversity of dialects, genres, and textual traditions:

  1. More selections now with parallel Arabic translations:
  2. Pseudo-Theophilus:
  3. Mercurius:
  4. Additions to Shenoute of Atripe’s Acephalous Work 22 (A22):
  5. Bohairic texts:
  6. Bohairic Bible selection manually segmented and tagged:

Collaborative Efforts and Future Directions

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder and Amir Zeldes, as well as Randy Komforty, Lydia Bremer-McCollum, Lawrence Rafferty, Nina Speranskaja, and Nicholas Wagner. We also want to thank the National Endowment for the Humanities for their ongoing support. The integration of OCR materials and the expansion of our Bohairic collection reflect ongoing efforts to enhance accessibility and analytical tools for Coptic studies. These advances also pave the way for further development of NLP tools for our users.

Accessing the Data

As with all our releases, the raw machine-readable data for all corpora—including morphological and syntactic annotations, as well as named entity recognition—are available in our GitHub repository. Data can be downloaded in a variety of popular formats to suit your research needs.

For advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet.

We invite you to explore this latest release and we look forward to your feedback!

New Corpora Release 4.4.0

Searching for Greek words in Shenoute’s So Concerning the Little Place

We are pleased to announce release 4.4.0 of Coptic Scriptorium! Our data now includes over 1,267,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of almost 100,000 tokens from the previous release). We are very grateful to all of our collaborators and contributors, without whom this project could not function.

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

We would like to thank the Marcion Project for making the underlying digitized text of Pistis Sophia available, and all of the annotators for their hard work. Tamara Siuda, Rebecca Krawiec, Philippe Zaher, and Lance Martin contributed, in addition to Amir and Carrie. As our current DHAG grant ends, we would like to give special thanks to Lance, who has been working as our DH specialist on the project since 2019, for doing an amazing job of keeping track of all the data and the various tasks he’s been in charge of over the past three years!

As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, on our GitHub repository, in a variety of popular formats:

https://github.com/copticscriptorium/corpora

You can also search for complex linguistic annotations in the data using our ANNIS server – please see our new tutorial here to get started with some query tips and a helpful cheat sheet:

https://copticscriptorium.org/ANNIS_tutorial

We hope this release will be useful and look forward to the next one as always!

New Corpora Release 4.2.0

Automatic linguistic analysis and Entity Linking from I Samuel 25

It is our pleasure to announce the latest data release from Coptic Scriptorium, version 4.2.0. This release contains both new Coptic material and additions to older datasets, as well as expanding our entity annotations and named-entity linking to all of our data, including the semi-automatically annotated Old Testament. The also means automatic updates to all of our interfaces, such as the recently added example usage functionality in the Coptic Dictionary Online, which is linked to the corpora.

The new material, including more digitized data courtesy of the Marcion project, as well as manually digitized and corrected OCR data from out of print editions includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 300,000 words of Sahidic Coptic annotated for entities.

This release represents a tremendous amount of work over the past few months by the Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document), and specifically So Miyagawa for help with Coptic OCR models, as well as the Marcion and CoptOT project for sharing their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Winter 2020 Corpora Release 3.1.0

It is our pleasure to announce a new data release, with a variety of new sources from our collaborators (including more digitized data courtesy of the Marcion and PAThs projects and other scholars). New in this release are:

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

The new material in this release includes some 78,000 tokens in 33 documents and represents a tremendous amount of work by our project members and collaborators. We would like to thank the individual contributors (which you can find in the ‘annotation’ metadata), the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools, which we plan to make available in the summer. Please let us know if you have any feedback!

On the Road Summer 2019

Coptic Scriptorium is busy this summer conference season.

I had the privilege of teaching one of the Sunoikisis Digital Classicist summer session earlier in July.

UCLA-St Shenouda Society image

The UCLA-St Shenouda Society conference participants, 2019

I also presented some research on girls and girlhood using the Coptic Scriptorium Corpora and the Online Coptic Dictionary at the annual UCLA-St. Shenouda Society Coptic Studies Conference.  This year was the 20th anniversary conference, and the theme was Shenoute and the White Monastery.

C. Schroeder presenting at ACH 2019; photo courtesy Melissa Dollman via Twitter

C. Schroeder presenting at ACH 2019; photo courtesy Melissa Dollman via Twitter

This week,  the American Digital Humanities organization, the Association for Computational Humanities, held a conference in Pittsburgh.  There I talked about colonialism, Coptic manuscripts, and resisting continuing colonialist tendencies in digitizing these manuscripts.

Meanwhile we’ve also been working on digitizing and annotating more texts, which we hope to release in the fall.

Happy summer everyone!

New corpora – release 2.4.0 is out!

We are pleased to announce release version 2.4.0 with new corpora, with tagged and lemmatized corpora available for reading and download at [1], and fully searchable at [2]:

[1] http://data.copticscriptorium.org/

[2] https://corpling.uis.georgetown.edu/annis/scriptorium

This release contains new data contributed by Alin Suciu, David Brakke and Diliana Atanassova, as well as out of copyright edition material contributed by the Marcion project. New data in this release includes excerpts from:

  • The Martyrdom of Saint Victor the General (2033 tokens)
  • The Canons of Apa Johannes (438 tokens)
  • Pseudo-Theophilus On the Cross and The Thief (2814 tokens)
  • Shenoute, Some Kinds of People Sift Dirt (888 tokens)
  • 11 additional Apophthegmata Patrum, bringing the total released to 63 apophthegms (7077 tokens)

All texts are also linked to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/), which has been updated with frequency information including these texts. We would like to thank the annotators and translators of these data sets, several of whom are new to the project, without whose work the corpora would not be online:

Alexander Turtureanu, Alin Suciu, Amir Zeldes, Caroline T. Schroeder, Christine Luckritz Marquis, Dana Robinson, David Brakke, David Sriboonreuang, Diliana Atanassova, Elizabeth Davidson, Elizabeth Platte, Gianna Zipp, J. Gregory Given, Janet Timbie, Jennifer Quigley, Laura Slaughter, Lauren McDermott, Marina Ghaly, Mitchell Abrams, Paul Lufter, Rebecca Krawiec, Saskia Franck and Tobias Paul

We hope everyone will find this release useful and look forward to releasing more data in the coming year!

 

New Release of Corpora

We’re pleased to announce that we’ve released more texts in our corpora.

The Sayings of the Desert Fathers (Apophthegmata Patrum) corpus now contains 52 sayings/apophthegms (>7100 words).  We have edited previously published sayings for consistency in annotation, and we’ve released new sayings edited by Christine Luckritz Marquis, Elizabeth Platte, and our newest contributor, Dana Robinson.  Read or browse the Sayings online.  Click on the “Analytic” button to see read a saying in Coptic with a parallel English translation + part of speech tags for each Coptic word.

Or click on the “Norm” button (short for “normalized”) to read the Coptic.  Clicking on any Coptic word in the normalized visualization will take you to an online Coptic-English dictionary.  Hovering your cursor over a passage in the normalized visualization will show the English translation in a pop up window.

AP 96 Normalized view screenshot

AP 96 Normalized view screenshot

Shenoute’s I See Your Eagerness now has numerous new manuscript fragments published (over 16,000 words).  We also have edited previously published witnesses for consistency in annotation.  These documents were transcribed and collated from the manuscripts by David Brakke and annotated for digital publication by Rebecca Krawiec.  Now you can read Shenoute’s I See Your Eagerness in nearly its entirety in Coptic.  We provide several paths for you to explore this text:

  1. Read the text from start to end, beginning with the first manuscript fragment. Click “NEXT” to keep reading.

    MONB.GL fragment D diplomatic visualization

    MONB.GL fragment D diplomatic visualization

    (No English translation is provided, but in the “Note” metadata field below the Coptic, you can find page numbers for David Brakke’s and Andrew Crislip’s translation in their book, Discourses of Shenoute.)  “Next” and “Previous” buttons will take you through the path we consider optimal for reading the text. This path wanders through various manuscript witnesses, following the path with the fewest lacunae. Want to see parallel witnesses? Check out the “Witness” metadata field below the text.

    MONB.GL 29-30 metadata screenshot

    MONB.GL 29-30 metadata screenshot

  2. Read through all surviving pages in one codex/manuscript witness by filtering for a particular codex. Click through the documents in that codex.  For example, if you want to read through all the fragments of codex MONB.GL, go to data.copticscriptorium.org, and use the menu to filter by Corpus for the shenoute.eagerness corpus, and then filter by manuscript name for the MONB.GL codex.   Click through the documents in that codex.
  3. Perform a search/query in our ANNIS database.   For example, search for all occurrences of “wicked” (ⲡⲟⲛⲏⲣⲟⲛ) in the corpus.  Or, search for occurrences of “wicked” controlling for duplicate hits in parallel manuscript witnesses.  See our guide to queries in ANNIS  for more tips.

You also can download the entire corpus in TEI XML, PAULA XML, and relANNIS formats  from our GitHub site.

December 2016 corpus release (v 2.2.0)

We are happy to release the following new and revised documents to our corpora.  A copy of the official release notes is below.  The data is available for download from GitHub in TEI XML, PAULA XML, and relANNIS formats.  The corpora can be viewed and accessed at data.copticscriptorium.org, and they all can  be queried in ANNIS. We plan for another release with more documents in March 2017.

As always:  if you have comments or corrections, please submit a pull request on GitHub or send us an email at contact [at] copticscriptorium [dot] org.

____

This corpus release includes new or revised documents for:

  • 1 Corinthians: machine and manual annotations; new documents are chapters 13-16; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Mark: machine and manual annotations; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Not Because a Fox Barks (Shenoute): machine and manual annotations; edits to already published document include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Besa letters: machine and manual annotations; edits to already published documents include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines

All other documents in our corpora are unchanged from the last release.

New metadata and corpus feature: We are beginning to add to our documents a metadata field called “order” which will allow us to present documents in a logical order for browsing or reading. We’ve implemented it in the Besa letters, corpus and will roll it out for other corpora in the future. Our Document Retrieval web application (data.copticscriptorium.org) now lists the documents in the order in which they appear in the manuscript tradition, when you filter for that corpus. Thus, users who wish to read or browse the documents in that order can do so easily.

Version control: We have set the version number on our document metadata, corpus metadata (in ANNIS), and release information (in GitHub) all to match. Version #s and dates are only revised when a document is revised. So if no documents in our AP corpus have been revised and republished, or no new documents for that corpus have been published, then the version # on the documents and corpus do not change. Only new and newly edited documents (and their corpora) will have version 2.2.0 and date 08 December 2016 in their metadata.

New feature + texts in our corpora: Apophthegmata, I See Your Eagerness

We are very excited to release new versions of two of our corpora in time for the Coptic Congress.  And keep reading to learn about a new feature on our website.

As usually, we provide a diplomatic transcription of the texts’ manuscripts, normalized text for ease of reading, and an analytic visualization with the normalized text and part of speech tags in our web application.  Plus you’ll see buttons to search the corpora in our database or download our digital files.

Apophthegmata Patrum

The Apophthegmata Patrum now contains 36 published Sayings.  New ones include

This release also marks the first contributions of our newest editor, Dr. Dana Lampe.  Dana earned her Ph.D. at the Catholic University of America is beginning a postdoc at Creighton in the fall.

I See Your Eagerness

We also are releasing a huge new chunk of Shenoute’s sermon, I See Your Eagerness.  These texts were transcribed and collated primarily by David Brakke (with some by Stephen Emmel).  We thank David for his  generous donation of his transcriptions to the project!  Senior Editor Rebecca Krawiec has digitized and annotated these transcriptions.

Please begin your read of I See Your Eagerness with the fragment from codex MONB.GL 9-10.   Or you can search it in our search & visualization tool ANNIS.

We now have over 9000 words of this text digitized and annotated!

New: “Next” & “Previous” Buttons on Document visualizations

We’ve got a new feature in our web application:  the “next” and “previous” buttons near the top of the text.

“Next” is the next document for this work; if there is a lacuna, you’ll be taken to the next extant witness we’ve digitized.  If there are multiple, parallel witnesses, you’ll be taken to the witness we’ve identified as the best or clearest witness (typically based on the amount of lacunae).

The same is true for the “Previous” button.

If you want to review the parallel witness(es), check out the metadatum field for each document called “witness.”  If a parallel witness exists, it will be listed; if we have digitized the witness, the URN for the witness will be listed.  You can enter the URN in the box at the top of our website to retrieve the document.

New born-digital edition of a Shenoute fragment

This winter we’ve released a new document we’ve been working on for a while.  It’s a born digital publication, in the sense that this document to our knowledge has never been published previously.  The edition and annotations here were produced by Elizabeth Platte (Reed College) and Rebecca S. Krawiec (Canisius College) directly from digital photographs of the manuscript for digital publication.

Read the manuscript transcription or the  normalized text, or query it in our database.

It’s a section of one of Shenoute’s texts for monks in volume three of his monastic Canons.  This 14-page (seven-folio) fragment now resides in the Bibliothèque Nationale in Paris and originally derives from the White Monastery codex known by the siglum MONB.YB.  We’ve released text and annotations for pages 307-320, which equate to the BN call number Ms Copte 130/2 ff. 51-57.  Digital photos are now available online at Gallica.

We’ve transcribed the text from images of the manuscript and then annotated it for manuscript information.  We’ve also broken the text down into the Coptic phrases known as “bound groups,” words, and morphs.  Then we’ve annotated it all for part of speech, loan words (Greek, Latin, etc.), and lemmas.

By “we” I mean primarily Platte and Krawiec .  Schroeder and Zeldes provided editorial review, as per our policy of having every published digital document reviewed by at least one editor.

As far as we know, this fragment has never been published; nor has any translation ever been published.  We don’t have a translation yet, either.

As the first born-digital edition, this document is an experiment for us.  Everything else we’ve worked with has been published in an edition, and sometimes even has an English translation that another scholar has published.  Even though we digitize from the original manuscript, previous editions and translations make the transcription, annotation, and editing process much easier.  This document is an unknown quantity.

This means we expect to have errors and welcome feedback on the document.

We also have no translation as of yet.  Our goal is to translate the document and then edit the transcription and annotations again as we work.  We hope to publish an essay on how the digital annotation process affected the creation of an edition.

In the meantime, use it to practice your Coptic.  Let us know if you find errors.  We’ll credit you.

Older posts