(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)
The first step in processing heterogeneous data in Coptic is deciding when words should be spelled together. As we described in part I, this is a problem because there are no spaces in original Coptic manuscripts, and editorial standards for how to segment words have varied through the centuries.
Moreover, even within a single edition, there may be inconsistencies in word segmentation. Take, for instance, the word ⲉⲃⲟⲗ (ebol), ‘out, outward’. This word is historically a combination of the preposition ⲉ (e), ‘to, towards’, and the noun ⲃⲟⲗ (bol), ‘outside, exterior’. In some editions, such as those by W. Budge, it is variously spelled as either one (ebol) or two (e bol) words, as in this example from the Asketikon of Apa Ephraim:
ⲡⲃⲱⲗ ⲉⲃⲟⲗ ⲉ ⲡⲉϩⲟⲩⲟ ϩⲛ̅ ⲟⲩϩⲓⲛⲏⲃ ⲉϥϩⲟⲣϣ̅
pbōl ebol e pehouo hn ouhineb efhorš
`<translation>`ⲙⲏ ⲙ̅ⲡ ϥ̅ⲃⲱⲗ ⲉ ⲃⲟⲗ ⲛ̅ ⲛⲉⲥⲛⲁⲩϩ
mē mp fbōl e bol n nesnauh
`<translation>`(Lines 12.2, 15.25 in Asketikon of Apa Ephraim. Transcribed by the Marcion project, based on W. Budge’s (1914) edition.)
Up until 2017, we had no automatic tools to ensure consistent word separation, and up until recently we used only a simple approach based on relative probability of being spelled apart: words spelled apart less than 90% of the time in our existing data were attached to the following word. For instance, across 4470 occurrences, the word ⲉ (e) was attached to the following word ~92% of the time. That is above 90%, so our simple system would always attach an ⲉ (e) to the following word, regardless of context. This approach is capable of effectively dealing with common cases such as prepositions, but it is incapable of handling more complex cases, e.g. where identically-spelled words exhibit different behaviors, or a word has never been seen before.
In summer of 2019, we set out to develop new machine learning tools for solving this whitespace normalization problem. We first considered the most obvious way to frame this problem, as a sequence to sequence (seq2seq) prediction problem: given a sequence of Coptic characters, predict another sequence of Coptic characters, hopefully with spaces inserted in the right places.
The problem is that seq2seq models require a lot of annotated data, much more than we had on hand. At the time, we only had on the order of tens of thousands of words’ worth of hand-normalized text from the type of edition shown in the example. We found that this was much too little data for any usual seq2seq model, like an LSTM (a Long-Short Term Memory neural network).
The key to progress was to observe that in most editions, the difference in whitespace was that there were too many spaces: it almost never happened that two words were spelled together that should have been apart. That left just the case where two words were spelled apart that should have been put together.
The question was now, simply, “for each whitespace that did occur in the edition, should we delete it, or should we keep it?” This is a simple binary classification task, which makes the task we’re asking of the computer much less demanding: instead of asking it to produce a stream of characters, we are asking it for a simple yes/no judgment.
But what kind of information goes into a yes/no decision like this? After a lot of experimentation, we found that the answers to these questions (in addition to others) were most helpful in deciding whether to keep or delete a space in between two words:
- How common are the words on either side of the space? (Our proxy for “common”ness: how often it appears in our annotated corpora)
- How common is the word I’d get if I deleted the space between the two words?
- How long are the two words and the words around them? (This might be a hint—it’s very unlikely, for instance, that a preposition would be more than several characters long.)
- What are the parts of speech of the words around the space?
- Does the word to the right consist solely of punctuation?
We tried several machine learning algorithms using this approach. To begin with, we only had ~10,000 words of training data, which is too little for many algorithms to effectively learn. In the end, our XGBoost model ended up performing the best, with a “correctness rate” (F1) of ~99%, against the naïve baseline (keep all spaces all the time), which hovered around 78%.