Tag: corpora (page 2 of 3)

Coptic Treebank Released

Yesterday we published the first public version of the Coptic Universal Dependency Treebank. This resource is the first syntactically annotated corpus of Coptic, containing complete analyses of each sentence in over 4,300 words of Coptic excerpts from Shenoute, the New Testament and the Apophthegmata Patrum.

To get an idea of the kind of analysis that Treebank data gives use, compare the following examples of an English and a Coptic dependency syntax tree. In the English tree below, the subject and object of the verb ‘depend’ on the verb for their grammatical function – the nominal subject (nsubj) is “I”, and the direct object (dobj) is “cat”.

cat_mat

We can quickly find out what’s going on in a sentence or ‘who did what to whom’ by looking at the arrows emanating from each word. The same holds for this Coptic example, which uses the same Universal Dependencies annotation schema, allowing us to compare English and Coptic syntax.

He gave them to the poor

He gave them to the poor

Treebanks are an essential component for linguistic research, but they also enable a variety of Natural Language Processing technologies to be used on a language. Beyond automatically parsing text to make some more analyzed data, we can use syntax trees for information extraction and entity recognition. For example, the first tree below shows us that “the Presbyter of Scetis” is a coherent entity (a subgraph, headed by a noun); the incorrect analysis following it would suggest Scetis is not part of the same unit as the Presbyter, meaning we could be dealing with a different person.

One time, the Presbyter of Scetis went...

One time, the Presbyter of Scetis went…

One time, the Presbyter went from Scetis... (incorrect!)

One time, the Presbyter went from Scetis… (incorrect!)

To find out more about this resource, check out the new Coptic Treebank webpage. And to read where the Presbyter of Scetis went, go to this URN: urn:cts:copticLit:ap.19.monbeg.

New Besa fragment published

We’ve published another small fragment of Besa on Coptic SCRIPTORIUM.  So Miyagawa has edited and translated the letter fragment known as On Lack of Food.  Read it online or search the letters of Besa we have published.

New born-digital edition of a Shenoute fragment

This winter we’ve released a new document we’ve been working on for a while.  It’s a born digital publication, in the sense that this document to our knowledge has never been published previously.  The edition and annotations here were produced by Elizabeth Platte (Reed College) and Rebecca S. Krawiec (Canisius College) directly from digital photographs of the manuscript for digital publication.

Read the manuscript transcription or the  normalized text, or query it in our database.

It’s a section of one of Shenoute’s texts for monks in volume three of his monastic Canons.  This 14-page (seven-folio) fragment now resides in the Bibliothèque Nationale in Paris and originally derives from the White Monastery codex known by the siglum MONB.YB.  We’ve released text and annotations for pages 307-320, which equate to the BN call number Ms Copte 130/2 ff. 51-57.  Digital photos are now available online at Gallica.

We’ve transcribed the text from images of the manuscript and then annotated it for manuscript information.  We’ve also broken the text down into the Coptic phrases known as “bound groups,” words, and morphs.  Then we’ve annotated it all for part of speech, loan words (Greek, Latin, etc.), and lemmas.

By “we” I mean primarily Platte and Krawiec .  Schroeder and Zeldes provided editorial review, as per our policy of having every published digital document reviewed by at least one editor.

As far as we know, this fragment has never been published; nor has any translation ever been published.  We don’t have a translation yet, either.

As the first born-digital edition, this document is an experiment for us.  Everything else we’ve worked with has been published in an edition, and sometimes even has an English translation that another scholar has published.  Even though we digitize from the original manuscript, previous editions and translations make the transcription, annotation, and editing process much easier.  This document is an unknown quantity.

This means we expect to have errors and welcome feedback on the document.

We also have no translation as of yet.  Our goal is to translate the document and then edit the transcription and annotations again as we work.  We hope to publish an essay on how the digital annotation process affected the creation of an edition.

In the meantime, use it to practice your Coptic.  Let us know if you find errors.  We’ll credit you.

Coptic NLP pipeline Part 2

With the creation of the Coptic NLP (Natural Language Processor) pipeline by Amir Zeldes, it is now possible to run all our NLP tools simultaneously without the need to individually download and run them. The web application will tokenize bound groups into words, and will normalize the spelling of words and diacritics. It will also tag for part-of-speech, lemmatize, and tag for language of origin for borrowed (foreign) words. The interface is XML tolerant (preserves tags in the input) and the output is tagged in SGML. One of the options is to encode the lines breaks in a word or sentence which is useful for encoding manuscripts. However, keep in mind to double check results because the interface is still in the beta stage.

As an example, the screenshot below is a snippet from I See Your Eagerness from manuscript MONB.GL29.

 

1.1

Notice it contains an XML tag to encode a letter as “large ekthetic”. “Large ekthetic” corresponds to the alpha letter to designate it as a large character in the left margin of the manuscript’s column of text.  This tag will be preserved in the output.

2

The results are shown above. Bounds group are shown and along with the part of speech tag abbreviated as “pos”. The snippet from I See Your Eagerness has also been lemmatized, shown as “lemma”. Also, near the bottom of the screenshot, the language of origin of borrowed (foreign) words in the snippet has been identified as “Greek”.  These tags also correspond to the annotation layers you see in our multi-layer search and visualization tool ANNIS.

We hope the NLP service serves you well.

 

Introducing the Lemmatizer Tool

A new tool available at the Coptic SCRIPTORIUM webpage is the lemmatizer. The lemmatizer annotates words with their dictionary head word. The purpose of lemmatization is to group together the different inflected forms of a word so they can be analyzed as a single item.

For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, and ‘walking’. The base form, ‘walk’, might be the word to look up in the dictionary, and it would be called the lemma for the word.

In Coptic, plural nouns sometimes have different forms, and verbs have different forms.  A lemmatized corpus is useful for searching all the forms of a word and also if you want to link all the forms of a word to an online dictionary for future use.

Two of the corpora we have are annotated with lemmas: Not because a fox barks (Shenoute) and the Apophthegmata. As illustrated in the image below, I have searched for ⲟⲩⲱϩ, to live or dwell.

1

Also note that in the corpus list, I have chosen to look in the corpus ‘Not Because a Fox Barks’, as indicated by the highlighted blue selection.

scriptorium ANNIS Corpus Search

Notice the word forms corresponding to the lemma I have searched for becomes highlighted in the corpus that was chosen.  Two forms of the verb ⲟⲩⲱϩ appear in the results:  ⲟⲩⲱϩ and ⲟⲩⲏϩ.  In addition, there is also an annotation grid.

Desctop screenshot

Clicking on the annotations grid reveals a plethora of information including the translation of the text along with its parts of speech. Hovering over the text using your computer’s mouse allows you to also find parts that may be related. For example, below  the POS (part of speech) is V (verb), and when the mouse is hovering over V, a highlight indicates what word in the text the verb is referring to.

2

3

The tool is a feature in our part-of-speech tagger, so you can lemmatize at the same time you annotate a corpus for parts of speech.  See https://github.com/CopticScriptorium/tagger-part-of-speech/.

Additional guidelines are available here:  https://github.com/CopticScriptorium/tagger-part-of-speech/blob/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf

New web application to read documents, cite data, and access data (BETA release)

We’re very excited to announce a new feature at Coptic SCRIPTORIUM.  We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations.  There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups.   This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out.  (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.)  We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.

 

 

Download release of all corpora in TEI XML, PAULA XML, relANNIS

We’ve released some new corpora (the papyri.info texts, for example) and some new documents to our existing corpora.  You can download everything in three different formats from our GitHub repository.  TEI XML, PAULA XML, and relANNIS.

Releasing new translation of section of Shenoute’s Acephalous Work 22

An English Translation (by Anthony Alcock) of part of Shenoute’s Acephalous Work 22 is available.  Anthony Alcock of the University of Kassel has contributed a translation of White Monastery Manuscript YA (MONB.YA) pages 421-28. This section corresponds to Leipoldt’s vol. 4, pp. 124-29. Coptic, English, and various annotations are available. Many thanks to Dr. Alcock for the contribution! We are in the process of a major addition to our website functionality, to enable you to read and find these texts more easily. In the meantime, you can access the text via our ANNIS search and visualization tool.  Click on the little page icon next to the shenoute.a22 corpus listing to see the visualizations.

Screen Shot 2015-06-11 at 3.50.07 PM of ANNIS corpus list

List of corpora in ANNIS

Read the English translation directly in the linguistic analysis view; read it as a pop-up when you hover over the Coptic in the normalized view.

screenshot: list of visualizations in ANNIS

Or search the English in ANNIS using a search string; to search for the word “work” in the English translations of Acephalous Work 22, use translation=/.*work.*/.

(Originally posted in March 2015 at http://copticscriptorium.org/)

Entire Sahidica New Testament now available

The entire Sahidica New Testament (machine-annotated) is now available. It has been tokenized and tagged for part of speech entirely automatically, using our tools. There has been no manual editing or correction. Visit our corpora for more information, or just jump in and search it in ANNIS.

 

(Originally posted in March 2015 at http://copticscriptorium.org/)

Corpora and how to use ANNIS

Coptic SCRIPTORIUM provides Coptic texts for reading, analysis, and complex searches. For a full list of our text corpora, please click here. We have also added answers to who and what some people and terms mean on our main site. A video tutorial given by Amir Zeldes and Caroline T. Schroeder is also available on how to search our database using the tool ANNIS.

 

(Originally posted in December of 2014 at http://copticscriptorium.org/)

Older posts Newer posts

© 2018

Theme by Anders NorenUp ↑