Tag: data visualization

A bird’s eye view of Coptic entities

Coptic Scriptorium recently annotated its Treebank for entities and will soon use automated tools to annotate all corpora. Entity recognition provides a window into what a text discusses, allowing readers to discover information about people and places of interest found throughout a large number of texts that they could not possibly read exhaustively. The Coptic Scriptorium team has developed a number of tools to visualize and search for entities, which you can browse here:

https://copticscriptorium.org/entities/breakdown.html

Already, we are seeing some interesting trends. Let’s take a closer look!

TreeMapping

Entities are divided into two broad categories—named and non-named. Named entities are headed by a proper noun, e.g., “Apa Pamoun” or “Scetis.” Non-named entities, which constitute the majority of annotations, are headed by a common noun, e.g., “the monk” or “the monastery.” All entities, whether named or non-named, have one of ten entity types, such as ‘person’, ‘place’ and more (see our previous post). In the image below, we see the TreeMap of unnamed places. With the nested view of data such as this, one can easily see patterns that may be missed when viewing the information in another format.

A TreeMap of unnamed places

Let’s look at the TreeMap data for non-named place entities. The desert holds an unparalleled place (no pun intended) in Coptic literature, but what exactly do Coptic texts say about it? One click of the mouse would show all eighteen mentions of entities headed by ϫⲁⲓⲉ ‘desert’ (see image below). We can see every instance of the word on the same screen and are able to compare usages. Another search would do the same for all references headed, i.e., no adjectival usages included, by the Greek word ⲉⲣⲏⲙⲟⲥ ‘desert.’ If you want to continue this line of inquiry and read every single instance of ‘desert’ in its larger context, a search for these entities in ANNIS (this function is coming soon) would display every mention in the Coptic corpora, allowing one to quickly see the texts in which these words appear and how they are used.

A TreeMap of ϫⲁⲓⲉ ‘desert’

Entity Term Networking

Entity Term Networks provide a graphic visualization of an entity’s relationships with other words in its span. For an example, let’s look at ⲙⲁ ‘place.’ From the outset, we see that ⲙⲁ is used with a wide variety of determiners and is followed by an even wider variety of constructions, but we simultaneously see that attributive adjectives, such as ⲙⲁ ⲛϣ(ⲱ)ⲱⲡⲉ ‘dwelling place, monk’s cell,’ are more commonly used with ⲙⲁ than relative or genitive constructions. The entity network for ⲙⲁ gives us a clearer idea of its potential semantic relationships: almost always followed by ⲛ ‘of’, continuing to nouns indicating purpose (place of dwelling, lavatory with ⲣⲙⲏ ‘urination’), events (ϣⲉⲗⲉⲉⲧ ‘wedding’), directions (ϣⲁ ‘East’) and more. Try pulling up the network for other Coptic nouns by yourself! As with the TreeMap, the network presents a large amount of data in a small space, revealing patterns and their relative frequency more readily.

Lexical network of words in entities headed by ⲙⲁ ‘place’

Entity Type Proportions

Entity proportions compare entity types among the corpora, visualizing them with a ratio. An average ratio is provided for all Coptic corpora and for a sample of English fiction, so viewers can see how far any given corpus departs from either baseline. The chart below sets the ratio of animals and people side by side. If you are interested in late-antique animals, you may be a little disappointed⁠—they only appear sparsely in the corpora. Any other combination juxtaposing entity types is possible. After looking through the data, it is clear that the Coptic average has a consistently higher ratio of abstract entities than the English fiction counterpart, perhaps representative of the monastic origin of much of its corpora.

Entity Type Proportion of Animals and People

Named/Non-Named by Corpus

The last visualization compares the ratio between named and non-named entities in each corpus. Once again, there is much variation between individual works, including those of the same genre (cf. The Life of Cyrus and The Life of Onnophrius), but the ratio dissimilitude may indicate where differences in content lie, pointing the way toward further research: this surprising difference between saints’ lives may merit more attention.

What’s Next?

Entity annotation makes detailed philological, literary, and historical inquiries from a large number of documents possible by enabling analysis of texts based on the quantity, proportion and dispersion of entity types. They allow us to describe texts on a level of ‘who did what to whom’ and abstract away from individual ways of phrasing references to people and places. We’re looking forward to releasing more tools and data for working with Coptic entities!

Fall 2019 Corpora Release 3.0.0

Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are:

  • Saints’ lives
    • Life of Cyrus
    • Life of Onnophrius
    • Lives of Longinus and Lucius
    • Martyrdom of Victor the General (part 2)
  •  Miscellaneous:
    • Dormition of John
    • Homilies of Proclus
    • Letter of Pseudo-Ephrem

We are also releasing expansions to some of our existing corpora, including:

  • Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova)
  • Apophthegmata Patrum
  • A large number of corrections to most of our existing corpora, which are being republished in this release.

All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).

You can search all corpora using ANNIS and download the data in 4 formats (relANNIS database files, PAULA XML files, TEI XML files, and SGML files in Tree-tagger format): browse on GitHub. If you just want to read works, cite project data or browse metadata, you can use our updated repository browser, the Canonical Text Services browser and URN resolver:

http://data.copticscriptorium.org/

Our total annotated corpora are now at over 850,000 words; corpora that have human editors who reviewed the machine annotations are now over 150,000!

We would like to thank Marcion, PAThs and the National Endowment for the Humanities for supporting us – we hope this release will be useful and are already working on more!

Online lexicon linked to our corpora!

We have a great announcement today.  Along with our German research partners as part of the KELLIA project, we are releasing an online Coptic lexicon linked to our corpora.

For over three years, the Berlin-Brandenberg Academy of Sciences has been working on a digital lexicon for Coptic.  Frank Feder began the work.  Frank Feder began creating it, encoding definitions for Coptic lemmas in three languages: English, French, and German. The final entries were completed by Maxim Kupreyev at the academy and Julien Delhez in Göttingen.  The base lexicon file is encoded in TEI-XML.  This summer Amir Zeldes and his student, Emma Manning, created a web interface.  We will release the source code soon as part of the KELLIA project.

It may still need some refinements and updates, but we think it is a useful achievement that will help anyone interested in Coptic.

Entries have definitions in French, German, and English.

You can use the lexicon as a standalone website.  For the pilot launch, it’s on the Georgetown server, but make no mistake, this is major research outcome for the BBAW.

We’ve also linked the dictionary to our texts in Coptic SCRIPTORIUM.  You can click on the ANNIS icon in the dictionary entry to search all corpora in Coptic SCRIPTORIUM for that word.

lexicon-to-ANNIS The link also goes in the other direction.  In the normalized visualization of our texts, you can click on a word and get taken to the entry for that word’s lemma in the dictionary.  You can do this in the normalized visualization in our web application for reading and accessing texts (pictured below), or in the normalized visualization embedded in the ANNIS tool.

Screen Shot 2016-07-28 at 10.22.39 AM

Of course there will be refinements and developments to come.  We would love to hear your feedback on what works, what could work better, and where you find glitches.

On a more personal note, when Amir and I first came up with the idea for the project, we dreamed of creating a Perseus Digital Library for Coptic.  This dictionary is a huge step forward.  And honestly, I myself had almost nothing to do with this piece of the project.  It’s an example of the importance and power of collaboration.

New feature + texts in our corpora: Apophthegmata, I See Your Eagerness

We are very excited to release new versions of two of our corpora in time for the Coptic Congress.  And keep reading to learn about a new feature on our website.

As usually, we provide a diplomatic transcription of the texts’ manuscripts, normalized text for ease of reading, and an analytic visualization with the normalized text and part of speech tags in our web application.  Plus you’ll see buttons to search the corpora in our database or download our digital files.

Apophthegmata Patrum

The Apophthegmata Patrum now contains 36 published Sayings.  New ones include

This release also marks the first contributions of our newest editor, Dr. Dana Lampe.  Dana earned her Ph.D. at the Catholic University of America is beginning a postdoc at Creighton in the fall.

I See Your Eagerness

We also are releasing a huge new chunk of Shenoute’s sermon, I See Your Eagerness.  These texts were transcribed and collated primarily by David Brakke (with some by Stephen Emmel).  We thank David for his  generous donation of his transcriptions to the project!  Senior Editor Rebecca Krawiec has digitized and annotated these transcriptions.

Please begin your read of I See Your Eagerness with the fragment from codex MONB.GL 9-10.   Or you can search it in our search & visualization tool ANNIS.

We now have over 9000 words of this text digitized and annotated!

New: “Next” & “Previous” Buttons on Document visualizations

We’ve got a new feature in our web application:  the “next” and “previous” buttons near the top of the text.

“Next” is the next document for this work; if there is a lacuna, you’ll be taken to the next extant witness we’ve digitized.  If there are multiple, parallel witnesses, you’ll be taken to the witness we’ve identified as the best or clearest witness (typically based on the amount of lacunae).

The same is true for the “Previous” button.

If you want to review the parallel witness(es), check out the metadatum field for each document called “witness.”  If a parallel witness exists, it will be listed; if we have digitized the witness, the URN for the witness will be listed.  You can enter the URN in the box at the top of our website to retrieve the document.

ANNIS embeds for websites and blogs

Starting this week, there’s a new feature in our ANNIS web interface: ANNIS embeds.

The ANNIS interface can now give you an HTML snippet that you can embed on your webpage, blog post and more.

Here’s an example of an embedded visualizer for a passage from Besa’s letter to Aphthonia, in which he recounts Aphthonia’s threat to go to another monastery:

(MONB.BA 47, urn:cts:copticLit:besa.aphthonia.monbba)

The code snippet for this visualization is as follows:

<iframe src="https://corpling.uis.georgetown.edu/annis/?id=31e2a273-426f-4aaf-922a-7fa0f0b311e1" width="100%" height="500"/>

To get an embeddable snippet, click the share icon at the top left of your search result and choose the visualization you want. To share the entire set of results, use the share button at the top right of the results page. Additionally, if you want to share an ANNIS search result via e-mail, you can still copy and paste the URL as before, but now you can also get a specific shareable link for individual hits using the same share button .

Let us know if you have any feedback!

New web application to read documents, cite data, and access data (BETA release)

We’re very excited to announce a new feature at Coptic SCRIPTORIUM.  We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations.  There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups.   This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out.  (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.)  We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.