Tag: NLP

Coptic Treebank Released

Yesterday we published the first public version of the Coptic Universal Dependency Treebank. This resource is the first syntactically annotated corpus of Coptic, containing complete analyses of each sentence in over 4,300 words of Coptic excerpts from Shenoute, the New Testament and the Apophthegmata Patrum.

To get an idea of the kind of analysis that Treebank data gives use, compare the following examples of an English and a Coptic dependency syntax tree. In the English tree below, the subject and object of the verb ‘depend’ on the verb for their grammatical function – the nominal subject (nsubj) is “I”, and the direct object (dobj) is “cat”.

cat_mat

We can quickly find out what’s going on in a sentence or ‘who did what to whom’ by looking at the arrows emanating from each word. The same holds for this Coptic example, which uses the same Universal Dependencies annotation schema, allowing us to compare English and Coptic syntax.

He gave them to the poor

He gave them to the poor

Treebanks are an essential component for linguistic research, but they also enable a variety of Natural Language Processing technologies to be used on a language. Beyond automatically parsing text to make some more analyzed data, we can use syntax trees for information extraction and entity recognition. For example, the first tree below shows us that “the Presbyter of Scetis” is a coherent entity (a subgraph, headed by a noun); the incorrect analysis following it would suggest Scetis is not part of the same unit as the Presbyter, meaning we could be dealing with a different person.

One time, the Presbyter of Scetis went...

One time, the Presbyter of Scetis went…

One time, the Presbyter went from Scetis... (incorrect!)

One time, the Presbyter went from Scetis… (incorrect!)

To find out more about this resource, check out the new Coptic Treebank webpage. And to read where the Presbyter of Scetis went, go to this URN: urn:cts:copticLit:ap.19.monbeg.

Full, machine-annotated New Testament Corpus updated

We’ve updated and re-released our fully machine-annotated New Testament corpus.  sahidica.nt V2.1.0 contains the Sahidica NT text from Warren Wells Sahidica online NT, with the following features:

  • Annotated with our latest NLP tools (part of speech tagger 1.9, tokenizer 4.1.0, language tagger and lemmatizer include lexical entries from the Database and Dictionary of Greek Loanwords in Coptic (DDGLC))
  • Now contains the morph layer (annotating compound words and Coptic morphs such ⲣⲉϥ- ⲙⲛⲧ- ⲁⲧ-)
  • Visualizations for linguistic analysis

Please keep in mind that this fully machine-annotated corpus is more accurate than previous versions but will nonetheless contain more errors than a corpus manually corrected by a human.

Search and queries

For searches and queries using our ANNIS database to find specific terms, for this corpus we recommend searching the normalized words using regular expressions (to capture instances of the desired word that may still be embedded in a Coptic bound group, instances that our tokenizer may have missed):

Lemma searches are now also possible.  You may wish to search for the lemma using regular expressions, as well, in order to find lemmas of some compound words.  For example, the following search will find entries containing ⲥⲱⲧⲙ in the lemma:

The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized the lexical entry “ⲥⲱⲧⲙ“, compound words lemmatized to ⲥⲱⲧⲙ or to a lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word form ⲥⲱⲧⲙ, which our tokenizer did not catch:

Frequency table of normalized words lemmatized to swtm or a lemma form containing swtm (May 2016 Sahidica corpus)

Frequency table of normalized words lemmatized to ⲥⲱⲧⲙ or a lemma form containing ⲥⲱⲧⲙ (May 2016 Sahidica corpus)

As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ, ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ).  We expect accuracy to increase as we incorporate more texts into our corpora that have been machine annotated and then manually edited.

Reading by individual chapter

You can also read these documents and see the linguistic analysis visualizations at data.copticscriptorium.org/urn:cts:copticLit:nt.  The first documents you will see (Gospel of Mark, 1 Corinthians) are manually annotated.  Scroll down for “New Testament,” which is the full, machine-annotated Sahidica New Testament.  Click on “Chapter” to read each chapter as normalized Coptic (with English translation as a pop-up when you hover your cursor).  Click on “Analytic” for the normalized Coptic, part of speech analysis, and English translation for each chapter.  Please keep in mind the English translation provided is a free, open-access New Testament translation from the World English Bible; it is not a direct translation from the Coptic.

Note:  we know that our server is slow generating the documents for this corpus.  It may take several minutes to load; please be patient.  For faster access, use ANNIS.  Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.

Accessing document visualizations of the Sahidica corpus via ANNIS

Accessing document visualizations of the Sahidica corpus via ANNIS

We hope this corpus is useful to researchers.

New Coptic morphological anlaysis

A new component has been added to the Coptic NLP pipe-line at:

https://corpling.uis.georgetown.edu/coptic-nlp/

This adds morphological analysis of complex word forms, including multiple affixes (e.g. derived nouns with affixes such as Coptic ‘mnt’, equivalent to English ‘-ness’), compounds (noun-noun combinations) and complex verbs. Using the automatic morphological analysis will substantially reduce the amount of manual work involved in putting new texts online, meaning we will be able to concentrate on getting more texts out there faster, as well as developing new tools and ways of interacting with the data.

© 2017

Theme by Anders NorenUp ↑