Author: Amir Zeldes

New release of the Coptic Treebank

Coptic Treebank release 2.1, now with three Letters of Besa!

We are pleased to announce the release of the latest version of the Coptic Treebank, now containing three Letters of Besa:

  • On Lack of Food
  • To Aphthonia
  • To Thieving Nuns

This brings the total corpus size up to 10,499 tokens, thanks to annotation work by Elizabeth Davidson and Amir Zeldes, building on earlier transcription and tagging work by Coptic Scriptorium and KELLIA partners. Special thanks are due to So Miyagawa for providing the transcription for On Lack of Food. The corpus will continue to grow as we work to annotate more data and improve the accuracy of our automatic syntax parser for Coptic. You can search the current version of the corpus in ANNIS here:

https://corpling.uis.georgetown.edu/annis/scriptorium

Or download the latest raw annotated data from GitHub here:

https://github.com/universalDependencies/UD_Coptic/tree/dev

Please let us know if you find any errors or have any feedback on the treebank!

Old Testament corpus release

We are happy to announce the release of the automatically annotated Sahidic Old Testament corpus (corpus identifier: sahidic.ot), based on the version of the available texts kindly provided by the CrossWire Bible Society SWORD Project thanks to work by Christian Askeland, Matthias Schulz and Troy Griffitts.

The corpus is available for search in ANNIS, much like the Sahidica New Testament corpus, together with word segmentation, morphological analysis, language of origin for loanwords, part of speech tagging and automatically aligned verse translations (except for parts of Jeremiah). Please expect some errors, due the fully automatic analysis in the corpus. The aligned translation is taken from the World English Bible. Here is an example search for the word ‘soul’:

norm=”ⲯⲩⲭⲏ”

You can also read entire chapters in ANNIS or at our repository, which look like this:

urn:cts:copticLit:ot.gen.crosswire:09

 

We hope that this resource will be helpful to Coptic scholars – please let us know if you have any questions or comments!

 

New release – Coptic Treebank V2

We are happy to announce the release of version 2 of the Coptic Universal Dependency Treebank. With over 8,500 tokens from 14 documents, the Treebank is the largest syntactically annotated resource in Coptic. The annotation scheme follows the Universal Dependency Guidelines, version 2, and is therefore comparable with UD data from 70 treebanks in 50 languages, including English, Latin, Classical Greek, Arabic, Hebrew and more.

You can search in the Treebank using ANNIS. For example, the following query finds cases of verbs dominating a complement clause (e.g. “say …. that …”):

pos="V" ->dep[func="ccomp"] norm

[Link to this query]

Coptic Treebank Released

Yesterday we published the first public version of the Coptic Universal Dependency Treebank. This resource is the first syntactically annotated corpus of Coptic, containing complete analyses of each sentence in over 4,300 words of Coptic excerpts from Shenoute, the New Testament and the Apophthegmata Patrum.

To get an idea of the kind of analysis that Treebank data gives use, compare the following examples of an English and a Coptic dependency syntax tree. In the English tree below, the subject and object of the verb ‘depend’ on the verb for their grammatical function – the nominal subject (nsubj) is “I”, and the direct object (dobj) is “cat”.

cat_mat

We can quickly find out what’s going on in a sentence or ‘who did what to whom’ by looking at the arrows emanating from each word. The same holds for this Coptic example, which uses the same Universal Dependencies annotation schema, allowing us to compare English and Coptic syntax.

He gave them to the poor

He gave them to the poor

Treebanks are an essential component for linguistic research, but they also enable a variety of Natural Language Processing technologies to be used on a language. Beyond automatically parsing text to make some more analyzed data, we can use syntax trees for information extraction and entity recognition. For example, the first tree below shows us that “the Presbyter of Scetis” is a coherent entity (a subgraph, headed by a noun); the incorrect analysis following it would suggest Scetis is not part of the same unit as the Presbyter, meaning we could be dealing with a different person.

One time, the Presbyter of Scetis went...

One time, the Presbyter of Scetis went…

One time, the Presbyter went from Scetis... (incorrect!)

One time, the Presbyter went from Scetis… (incorrect!)

To find out more about this resource, check out the new Coptic Treebank webpage. And to read where the Presbyter of Scetis went, go to this URN: urn:cts:copticLit:ap.19.monbeg.

ANNIS embeds for websites and blogs

Starting this week, there’s a new feature in our ANNIS web interface: ANNIS embeds.

The ANNIS interface can now give you an HTML snippet that you can embed on your webpage, blog post and more.

Here’s an example of an embedded visualizer for a passage from Besa’s letter to Aphthonia, in which he recounts Aphthonia’s threat to go to another monastery:

(MONB.BA 47, urn:cts:copticLit:besa.aphthonia.monbba)

The code snippet for this visualization is as follows:

<iframe src="https://corpling.uis.georgetown.edu/annis/?id=31e2a273-426f-4aaf-922a-7fa0f0b311e1" width="100%" height="500"/>

To get an embeddable snippet, click the share icon at the top left of your search result and choose the visualization you want. To share the entire set of results, use the share button at the top right of the results page. Additionally, if you want to share an ANNIS search result via e-mail, you can still copy and paste the URL as before, but now you can also get a specific shareable link for individual hits using the same share button .

Let us know if you have any feedback!

New Coptic morphological anlaysis

A new component has been added to the Coptic NLP pipe-line at:

https://corpling.uis.georgetown.edu/coptic-nlp/

This adds morphological analysis of complex word forms, including multiple affixes (e.g. derived nouns with affixes such as Coptic ‘mnt’, equivalent to English ‘-ness’), compounds (noun-noun combinations) and complex verbs. Using the automatic morphological analysis will substantially reduce the amount of manual work involved in putting new texts online, meaning we will be able to concentrate on getting more texts out there faster, as well as developing new tools and ways of interacting with the data.

New Coptic NLP pipeline

The entire tool chain of Coptic Natural Language Processing has been difficult to get running for some: it involves a bunch of command line tools, and special attention needed to be paid to coordinating word division expectations between the tools (normalization, tagging, language of origin detection). In order to make this process simpler, we now offer a Web interface that let’s you paste in Coptic text and run all tools on the input automatically, without installing anything. You can find the interface here:

https://corpling.uis.georgetown.edu/coptic-nlp/

The pipeline is XML tolerant (preserves tags in the input) and there’s also a machine actionable API version for external software to use these resources. Please let the Scriptorium team know if you’re using the pipeline and/or run into any problems.

 

Happy processing!

© 2017

Theme by Anders NorenUp ↑