A new tool available at the Coptic SCRIPTORIUM webpage is the lemmatizer. The lemmatizer annotates words with their dictionary head word. The purpose of lemmatization is to group together the different inflected forms of a word so they can be analyzed as a single item.
For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, and ‘walking’. The base form, ‘walk’, might be the word to look up in the dictionary, and it would be called the lemma for the word.
In Coptic, plural nouns sometimes have different forms, and verbs have different forms. A lemmatized corpus is useful for searching all the forms of a word and also if you want to link all the forms of a word to an online dictionary for future use.
Two of the corpora we have are annotated with lemmas: Not because a fox barks (Shenoute) and the Apophthegmata. As illustrated in the image below, I have searched for ⲟⲩⲱϩ, to live or dwell.
Also note that in the corpus list, I have chosen to look in the corpus ‘Not Because a Fox Barks’, as indicated by the highlighted blue selection.
Notice the word forms corresponding to the lemma I have searched for becomes highlighted in the corpus that was chosen. Two forms of the verb ⲟⲩⲱϩ appear in the results: ⲟⲩⲱϩ and ⲟⲩⲏϩ. In addition, there is also an annotation grid.
Clicking on the annotations grid reveals a plethora of information including the translation of the text along with its parts of speech. Hovering over the text using your computer’s mouse allows you to also find parts that may be related. For example, below the POS (part of speech) is V (verb), and when the mouse is hovering over V, a highlight indicates what word in the text the verb is referring to.
The tool is a feature in our part-of-speech tagger, so you can lemmatize at the same time you annotate a corpus for parts of speech. See https://github.com/CopticScriptorium/tagger-part-of-speech/.
Additional guidelines are available here: https://github.com/CopticScriptorium/tagger-part-of-speech/blob/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf
October 5, 2015 at 3:35 pm
Just chiming in to add: The Gospel of Mark corpus is now also lemmatized. The rest will follow soon!
Amir