Tag: ANNIS (page 1 of 2)

New Release of Corpora

We’re pleased to announce that we’ve released more texts in our corpora.

The Sayings of the Desert Fathers (Apophthegmata Patrum) corpus now contains 52 sayings/apophthegms (>7100 words).  We have edited previously published sayings for consistency in annotation, and we’ve released new sayings edited by Christine Luckritz Marquis, Elizabeth Platte, and our newest contributor, Dana Robinson.  Read or browse the Sayings online.  Click on the “Analytic” button to see read a saying in Coptic with a parallel English translation + part of speech tags for each Coptic word.

Or click on the “Norm” button (short for “normalized”) to read the Coptic.  Clicking on any Coptic word in the normalized visualization will take you to an online Coptic-English dictionary.  Hovering your cursor over a passage in the normalized visualization will show the English translation in a pop up window.

AP 96 Normalized view screenshot

AP 96 Normalized view screenshot

Shenoute’s I See Your Eagerness now has numerous new manuscript fragments published (over 16,000 words).  We also have edited previously published witnesses for consistency in annotation.  These documents were transcribed and collated from the manuscripts by David Brakke and annotated for digital publication by Rebecca Krawiec.  Now you can read Shenoute’s I See Your Eagerness in nearly its entirety in Coptic.  We provide several paths for you to explore this text:

  1. Read the text from start to end, beginning with the first manuscript fragment. Click “NEXT” to keep reading.
    MONB.GL fragment D diplomatic visualization

    MONB.GL fragment D diplomatic visualization

    (No English translation is provided, but in the “Note” metadata field below the Coptic, you can find page numbers for David Brakke’s and Andrew Crislip’s translation in their book, Discourses of Shenoute.)  “Next” and “Previous” buttons will take you through the path we consider optimal for reading the text. This path wanders through various manuscript witnesses, following the path with the fewest lacunae. Want to see parallel witnesses? Check out the “Witness” metadata field below the text.

    MONB.GL 29-30 metadata screenshot

    MONB.GL 29-30 metadata screenshot

  2. Read through all surviving pages in one codex/manuscript witness by filtering for a particular codex. Click through the documents in that codex.  For example, if you want to read through all the fragments of codex MONB.GL, go to data.copticscriptorium.org, and use the menu to filter by Corpus for the shenoute.eagerness corpus, and then filter by manuscript name for the MONB.GL codex.   Click through the documents in that codex.
  3. Perform a search/query in our ANNIS database.   For example, search for all occurrences of “wicked” (ⲡⲟⲛⲏⲣⲟⲛ) in the corpus.  Or, search for occurrences of “wicked” controlling for duplicate hits in parallel manuscript witnesses.  See our guide to queries in ANNIS  for more tips.

You also can download the entire corpus in TEI XML, PAULA XML, and relANNIS formats  from our GitHub site.

New release – Coptic Treebank V2

We are happy to announce the release of version 2 of the Coptic Universal Dependency Treebank. With over 8,500 tokens from 14 documents, the Treebank is the largest syntactically annotated resource in Coptic. The annotation scheme follows the Universal Dependency Guidelines, version 2, and is therefore comparable with UD data from 70 treebanks in 50 languages, including English, Latin, Classical Greek, Arabic, Hebrew and more.

You can search in the Treebank using ANNIS. For example, the following query finds cases of verbs dominating a complement clause (e.g. “say …. that …”):

pos="V" ->dep[func="ccomp"] norm

[Link to this query]

December 2016 corpus release (v 2.2.0)

We are happy to release the following new and revised documents to our corpora.  A copy of the official release notes is below.  The data is available for download from GitHub in TEI XML, PAULA XML, and relANNIS formats.  The corpora can be viewed and accessed at data.copticscriptorium.org, and they all can  be queried in ANNIS. We plan for another release with more documents in March 2017.

As always:  if you have comments or corrections, please submit a pull request on GitHub or send us an email at contact [at] copticscriptorium [dot] org.

____

This corpus release includes new or revised documents for:

  • 1 Corinthians: machine and manual annotations; new documents are chapters 13-16; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Mark: machine and manual annotations; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Not Because a Fox Barks (Shenoute): machine and manual annotations; edits to already published document include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
  • Besa letters: machine and manual annotations; edits to already published documents include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines

All other documents in our corpora are unchanged from the last release.

New metadata and corpus feature: We are beginning to add to our documents a metadata field called “order” which will allow us to present documents in a logical order for browsing or reading. We’ve implemented it in the Besa letters, corpus and will roll it out for other corpora in the future. Our Document Retrieval web application (data.copticscriptorium.org) now lists the documents in the order in which they appear in the manuscript tradition, when you filter for that corpus. Thus, users who wish to read or browse the documents in that order can do so easily.

Version control: We have set the version number on our document metadata, corpus metadata (in ANNIS), and release information (in GitHub) all to match. Version #s and dates are only revised when a document is revised. So if no documents in our AP corpus have been revised and republished, or no new documents for that corpus have been published, then the version # on the documents and corpus do not change. Only new and newly edited documents (and their corpora) will have version 2.2.0 and date 08 December 2016 in their metadata.

Server updates – part of site down tonight

We are making some updates to the document application at data.copticscriptorium.org tonight (13 December 2016) approximately 7:30-8:30 pm Pacific time/10:30-11:30 Eastern time.  The service may be down.

You can still query and access our corpora in the ANNIS database at https://corpling.uis.georgetown.edu/annis/scriptorium .  That service will not be affected.  Thanks!

Online lexicon linked to our corpora!

We have a great announcement today.  Along with our German research partners as part of the KELLIA project, we are releasing an online Coptic lexicon linked to our corpora.

For over three years, the Berlin-Brandenberg Academy of Sciences has been working on a digital lexicon for Coptic.  Frank Feder began the work.  Frank Feder began creating it, encoding definitions for Coptic lemmas in three languages: English, French, and German. The final entries were completed by Maxim Kupreyev at the academy and Julien Delhez in Göttingen.  The base lexicon file is encoded in TEI-XML.  This summer Amir Zeldes and his student, Emma Manning, created a web interface.  We will release the source code soon as part of the KELLIA project.

It may still need some refinements and updates, but we think it is a useful achievement that will help anyone interested in Coptic.

Entries have definitions in French, German, and English.

You can use the lexicon as a standalone website.  For the pilot launch, it’s on the Georgetown server, but make no mistake, this is major research outcome for the BBAW.

We’ve also linked the dictionary to our texts in Coptic SCRIPTORIUM.  You can click on the ANNIS icon in the dictionary entry to search all corpora in Coptic SCRIPTORIUM for that word.

lexicon-to-ANNIS The link also goes in the other direction.  In the normalized visualization of our texts, you can click on a word and get taken to the entry for that word’s lemma in the dictionary.  You can do this in the normalized visualization in our web application for reading and accessing texts (pictured below), or in the normalized visualization embedded in the ANNIS tool.

Screen Shot 2016-07-28 at 10.22.39 AM

Of course there will be refinements and developments to come.  We would love to hear your feedback on what works, what could work better, and where you find glitches.

On a more personal note, when Amir and I first came up with the idea for the project, we dreamed of creating a Perseus Digital Library for Coptic.  This dictionary is a huge step forward.  And honestly, I myself had almost nothing to do with this piece of the project.  It’s an example of the importance and power of collaboration.

New feature + texts in our corpora: Apophthegmata, I See Your Eagerness

We are very excited to release new versions of two of our corpora in time for the Coptic Congress.  And keep reading to learn about a new feature on our website.

As usually, we provide a diplomatic transcription of the texts’ manuscripts, normalized text for ease of reading, and an analytic visualization with the normalized text and part of speech tags in our web application.  Plus you’ll see buttons to search the corpora in our database or download our digital files.

Apophthegmata Patrum

The Apophthegmata Patrum now contains 36 published Sayings.  New ones include

This release also marks the first contributions of our newest editor, Dr. Dana Lampe.  Dana earned her Ph.D. at the Catholic University of America is beginning a postdoc at Creighton in the fall.

I See Your Eagerness

We also are releasing a huge new chunk of Shenoute’s sermon, I See Your Eagerness.  These texts were transcribed and collated primarily by David Brakke (with some by Stephen Emmel).  We thank David for his  generous donation of his transcriptions to the project!  Senior Editor Rebecca Krawiec has digitized and annotated these transcriptions.

Please begin your read of I See Your Eagerness with the fragment from codex MONB.GL 9-10.   Or you can search it in our search & visualization tool ANNIS.

We now have over 9000 words of this text digitized and annotated!

New: “Next” & “Previous” Buttons on Document visualizations

We’ve got a new feature in our web application:  the “next” and “previous” buttons near the top of the text.

“Next” is the next document for this work; if there is a lacuna, you’ll be taken to the next extant witness we’ve digitized.  If there are multiple, parallel witnesses, you’ll be taken to the witness we’ve identified as the best or clearest witness (typically based on the amount of lacunae).

The same is true for the “Previous” button.

If you want to review the parallel witness(es), check out the metadatum field for each document called “witness.”  If a parallel witness exists, it will be listed; if we have digitized the witness, the URN for the witness will be listed.  You can enter the URN in the box at the top of our website to retrieve the document.

Coptic SCRIPTORIUM at the Coptic Congress

Much of the Coptic SCRIPTORIUM team is in Claremont this week for the Congress of the International Association of Coptic Studies.

We started out with a pre-conference, 2-day workshop with our KELLIA partners from Germany, where we worked on sharing data and technologies across digital Coptic projects.  Look here soon for an announcement about a really cool fruit of our labors.

Thursday there are two panels, and Friday there are two workshops.

Thursday 2-4 pm Coptic Digital Studies (Burkle 16)

David Brakke chair 

Prof. Dr. Caroline Schroeder, Coptic SCRIPTORIUM: A Digital Platform for Research in Coptic Language and Literature

Dr. Christine Luckritz Marquis, Reimagining the Apopthegmata Patrum in a Digital Culture

Prof. Amir Zeldes, A Quantitative Approach to Syntactic Alternations in Sahidic

Dr. Rebecca Krawiec, Charting Rhetorical Choices in Shenoute: Abraham our Father and I See Your Eagerness as case-studies

Thursday 4:30-6:30 Coptic Digital Humanities (Burkle 16)

Caroline T. Schroeder, Chair

Dr. Paul Dilley, Coptic Scriptorium beyond the Manuscript: Towards a Distant Reading of Coptic Texts

Mr. So Miyagawa and Dr. Marco Büchler, Computational Analysis of Text Reuse in Shenoute and Besa

Mr. Uwe Sikora, Text Encoding – Opportunities and Challenges

Ms. Eliese-Sophia Lincke, Optical Character Recogition (OCR) for Coptic. Testing Automated Digitization of Texts with OCRopy

 

Friday 11-12:30 Workshop on Coptic Fonts & Coptic Bible (AA)

Christian Askeland, Frank Feder

Friday 4:30-6 Digital Tools for Beginners (Workshop on Coptic SCRIPTORIUM)

Caroline T. Schroeder, Amir Zeldes, Rebecca S. Krawiec

Full, machine-annotated New Testament Corpus updated

We’ve updated and re-released our fully machine-annotated New Testament corpus.  sahidica.nt V2.1.0 contains the Sahidica NT text from Warren Wells Sahidica online NT, with the following features:

  • Annotated with our latest NLP tools (part of speech tagger 1.9, tokenizer 4.1.0, language tagger and lemmatizer include lexical entries from the Database and Dictionary of Greek Loanwords in Coptic (DDGLC))
  • Now contains the morph layer (annotating compound words and Coptic morphs such ⲣⲉϥ- ⲙⲛⲧ- ⲁⲧ-)
  • Visualizations for linguistic analysis

Please keep in mind that this fully machine-annotated corpus is more accurate than previous versions but will nonetheless contain more errors than a corpus manually corrected by a human.

Search and queries

For searches and queries using our ANNIS database to find specific terms, for this corpus we recommend searching the normalized words using regular expressions (to capture instances of the desired word that may still be embedded in a Coptic bound group, instances that our tokenizer may have missed):

Lemma searches are now also possible.  You may wish to search for the lemma using regular expressions, as well, in order to find lemmas of some compound words.  For example, the following search will find entries containing ⲥⲱⲧⲙ in the lemma:

The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized the lexical entry “ⲥⲱⲧⲙ“, compound words lemmatized to ⲥⲱⲧⲙ or to a lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word form ⲥⲱⲧⲙ, which our tokenizer did not catch:

Frequency table of normalized words lemmatized to swtm or a lemma form containing swtm (May 2016 Sahidica corpus)

Frequency table of normalized words lemmatized to ⲥⲱⲧⲙ or a lemma form containing ⲥⲱⲧⲙ (May 2016 Sahidica corpus)

As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ, ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ).  We expect accuracy to increase as we incorporate more texts into our corpora that have been machine annotated and then manually edited.

Reading by individual chapter

You can also read these documents and see the linguistic analysis visualizations at data.copticscriptorium.org/urn:cts:copticLit:nt.  The first documents you will see (Gospel of Mark, 1 Corinthians) are manually annotated.  Scroll down for “New Testament,” which is the full, machine-annotated Sahidica New Testament.  Click on “Chapter” to read each chapter as normalized Coptic (with English translation as a pop-up when you hover your cursor).  Click on “Analytic” for the normalized Coptic, part of speech analysis, and English translation for each chapter.  Please keep in mind the English translation provided is a free, open-access New Testament translation from the World English Bible; it is not a direct translation from the Coptic.

Note:  we know that our server is slow generating the documents for this corpus.  It may take several minutes to load; please be patient.  For faster access, use ANNIS.  Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.

Accessing document visualizations of the Sahidica corpus via ANNIS

Accessing document visualizations of the Sahidica corpus via ANNIS

We hope this corpus is useful to researchers.

ANNIS embeds for websites and blogs

Starting this week, there’s a new feature in our ANNIS web interface: ANNIS embeds.

The ANNIS interface can now give you an HTML snippet that you can embed on your webpage, blog post and more.

Here’s an example of an embedded visualizer for a passage from Besa’s letter to Aphthonia, in which he recounts Aphthonia’s threat to go to another monastery:

(MONB.BA 47, urn:cts:copticLit:besa.aphthonia.monbba)

The code snippet for this visualization is as follows:

<iframe src="https://corpling.uis.georgetown.edu/annis/?id=31e2a273-426f-4aaf-922a-7fa0f0b311e1" width="100%" height="500"/>

To get an embeddable snippet, click the share icon at the top left of your search result and choose the visualization you want. To share the entire set of results, use the share button at the top right of the results page. Additionally, if you want to share an ANNIS search result via e-mail, you can still copy and paste the URL as before, but now you can also get a specific shareable link for individual hits using the same share button .

Let us know if you have any feedback!

New web application to read documents, cite data, and access data (BETA release)

We’re very excited to announce a new feature at Coptic SCRIPTORIUM.  We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations.  There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups.   This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out.  (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.)  We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.

 

 

Older posts

© 2017

Theme by Anders NorenUp ↑