Page 8 of 10

New Coptic morphological anlaysis

A new component has been added to the Coptic NLP pipe-line at:

https://corpling.uis.georgetown.edu/coptic-nlp/

This adds morphological analysis of complex word forms, including multiple affixes (e.g. derived nouns with affixes such as Coptic ‘mnt’, equivalent to English ‘-ness’), compounds (noun-noun combinations) and complex verbs. Using the automatic morphological analysis will substantially reduce the amount of manual work involved in putting new texts online, meaning we will be able to concentrate on getting more texts out there faster, as well as developing new tools and ways of interacting with the data.

Coptic NLP pipeline Part 2

With the creation of the Coptic NLP (Natural Language Processor) pipeline by Amir Zeldes, it is now possible to run all our NLP tools simultaneously without the need to individually download and run them. The web application will tokenize bound groups into words, and will normalize the spelling of words and diacritics. It will also tag for part-of-speech, lemmatize, and tag for language of origin for borrowed (foreign) words. The interface is XML tolerant (preserves tags in the input) and the output is tagged in SGML. One of the options is to encode the lines breaks in a word or sentence which is useful for encoding manuscripts. However, keep in mind to double check results because the interface is still in the beta stage.

As an example, the screenshot below is a snippet from I See Your Eagerness from manuscript MONB.GL29.

 

1.1

Notice it contains an XML tag to encode a letter as “large ekthetic”. “Large ekthetic” corresponds to the alpha letter to designate it as a large character in the left margin of the manuscript’s column of text.  This tag will be preserved in the output.

2

The results are shown above. Bounds group are shown and along with the part of speech tag abbreviated as “pos”. The snippet from I See Your Eagerness has also been lemmatized, shown as “lemma”. Also, near the bottom of the screenshot, the language of origin of borrowed (foreign) words in the snippet has been identified as “Greek”.  These tags also correspond to the annotation layers you see in our multi-layer search and visualization tool ANNIS.

We hope the NLP service serves you well.

 

New Coptic NLP pipeline

The entire tool chain of Coptic Natural Language Processing has been difficult to get running for some: it involves a bunch of command line tools, and special attention needed to be paid to coordinating word division expectations between the tools (normalization, tagging, language of origin detection). In order to make this process simpler, we now offer a Web interface that let’s you paste in Coptic text and run all tools on the input automatically, without installing anything. You can find the interface here:

https://corpling.uis.georgetown.edu/coptic-nlp/

The pipeline is XML tolerant (preserves tags in the input) and there’s also a machine actionable API version for external software to use these resources. Please let the Scriptorium team know if you’re using the pipeline and/or run into any problems.

 

Happy processing!

Introducing the Lemmatizer Tool

A new tool available at the Coptic SCRIPTORIUM webpage is the lemmatizer. The lemmatizer annotates words with their dictionary head word. The purpose of lemmatization is to group together the different inflected forms of a word so they can be analyzed as a single item.

For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, and ‘walking’. The base form, ‘walk’, might be the word to look up in the dictionary, and it would be called the lemma for the word.

In Coptic, plural nouns sometimes have different forms, and verbs have different forms.  A lemmatized corpus is useful for searching all the forms of a word and also if you want to link all the forms of a word to an online dictionary for future use.

Two of the corpora we have are annotated with lemmas: Not because a fox barks (Shenoute) and the Apophthegmata. As illustrated in the image below, I have searched for ⲟⲩⲱϩ, to live or dwell.

1

Also note that in the corpus list, I have chosen to look in the corpus ‘Not Because a Fox Barks’, as indicated by the highlighted blue selection.

scriptorium ANNIS Corpus Search

Notice the word forms corresponding to the lemma I have searched for becomes highlighted in the corpus that was chosen.  Two forms of the verb ⲟⲩⲱϩ appear in the results:  ⲟⲩⲱϩ and ⲟⲩⲏϩ.  In addition, there is also an annotation grid.

Desctop screenshot

Clicking on the annotations grid reveals a plethora of information including the translation of the text along with its parts of speech. Hovering over the text using your computer’s mouse allows you to also find parts that may be related. For example, below  the POS (part of speech) is V (verb), and when the mouse is hovering over V, a highlight indicates what word in the text the verb is referring to.

2

3

The tool is a feature in our part-of-speech tagger, so you can lemmatize at the same time you annotate a corpus for parts of speech.  See https://github.com/CopticScriptorium/tagger-part-of-speech/.

Additional guidelines are available here:  https://github.com/CopticScriptorium/tagger-part-of-speech/blob/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf

Wishing the NEH a happy 50th Anniversary!

The Coptic SCRIPTORIUM team would like to wish the National Endowment for the Humanities (NEH), a happy 50th anniversary! We would like to thank the NEH for supporting Coptic SCRIPTORIUM. Cheers to the NEH!

50thsocialengagement

Hiring: Digital Humanities Specialist for KELLIA and U Pacific Library

Digital Humanities Specialist at the University of the Pacific

The University of the Pacific seeks to hire a creative and collaborative Digital Humanities Specialist (DHS) to develop and manage strategies and infrastructure for curating digital and pre-digital content and data; provide computer programming support for projects; and author and/or co-author new digital humanities resources or scholarship.  This is a full-time 20-24 month pilot staff position. The DHS will work half-time contributing to the University Library’s archival and digital initiatives and half-time on an interdisciplinary NEH-funded Digital Humanities research project, KELLIA.  The DHS will report to Prof. Caroline T. Schroeder in the Department of Religious Studies and Michael Wurtz, the Head of Special Collections.

[Apply for this position at the University of the Pacific website]

KELLIA (Koptische/Coptic Electronic Language and Literature International Alliance) is an international DH project funded by the NEH and the DFG (Germany) to develop international standards and promote digital scholarship in the language and literature of ancient Egypt.  Researchers at the University of the Pacific, Georgetown University, Goettingen University, and Muenster University will be collaborating on digital methods in textual studies, linguistics, history, and manuscript studies.

The William Knox Holt Memorial Library on the Stockton campus serves a diverse community of liberal arts and professional faculty.  The Holt-Atherton Special Collections is home to several important American cultural heritage collections:  the multimedia archives of jazz legend Dave Brubeck; primary source documents from World War II Japanese-American Internment Camps; the papers of renowned naturalist and conservationist John Muir; and the papers and video archive of former San Francisco Mayor George R. Moscone.

Duties

The Digital Humanities Specialist may perform some but not all of the following duities and/or may be assigned additional duties:

  1. Develops and manages strategies and infrastructure for curating digital humanities content and data.
  2. Authors/co-authors new digital humanities resources or scholarship.
  3. Provides web development and programming for humanities research.
  4. Contributes to original research in digital humanities.
  5. Contributes to planning and decision-making about KELLIA’s technological development and long-term sustainability.
  6. Identifies, recommends, and implements linked open data technologies for humanities research.
  7. Identifies, recommends, and implements digital asset management and digital archiving in the Library.
  8. Participates in archival processing and reference duties in a special collections environment.  
  9. Designs forward-facing, interactive digital initiatives, websites, and/or exhibits.
  10. Provides library and special collections instruction.

 

QUALIFICATIONS:

Education/Work Experience/Certifications:

  • 1) MA in Digital Humanities OR 2) MLIS from an accredited ALA program or MA in Archival Studies with demonstrated digital/technological training/certification OR 3) MA in a Humanities discipline or related field with demonstrated digital/technological training or certification
  • Documented research and/or teaching experience in digital scholarship or pedagogy in a humanities discipline or related field
  • Demonstrated experience in web development and programming for research and/or teaching in the humanities or a related field (including archival studies and library and information science)

Skills/Knowledge and Expertise:

Required skills/knowledge and expertise

  • Excellent interpersonal, presentation, and communication skills
  • Demonstrated expertise in digital humanities technologies of web development (HTML, CSS, PHP, JavaScript), text encoding (XML), and programming (Python, Java)
  • Commitment to open access technologies and data for the humanities or a related field
  • Proven ability to work collaboratively in team-based initiatives
  • Proven ability to contribute to original scholarship in the humanities or a related field
  • Enthusiasm to build international and interdisciplinary research partnerships
  • Proven ability to work successfully with diverse populations and demonstrated commitment to promote and enhance diversity and inclusion
  • Knowledge of ancient languages, while welcome, is not a requirement for this position.

Preferred skills/knowledge and expertise

  • Demonstrated expertise with data curation techniques for a variety of digitized and born-digital media (text, code, images, music, etc.) and tools (e.g., DSpace, EPrints, Fedora, contentDM, etc.)
  • Demonstrated experience with linked data technologies and methodologies (e.g., JSON, RDF)
  • Experience managing CMS and LMS systems
  • Command of archival theory and best practices, especially as they relate to the particular issues posed by born-digital content.  

APPLICATION:

To apply for this position visit https://pacific.peopleadmin.com/postings/5822 and submit:

  • Letter of interest
  • CV
  • Names and contact information for 3 references

Review of applications will begin on September 1.

Questions about the position may be directed to cschroeder@pacific.edu and mwurtz@pacific.edu.  For questions about the online application process, please consult the online help system.

This position is funded by the University of the Pacific Library and the National Endowment for the Humanities (through the joint NEH-DFG bilateral Digital Humanities grant program).

New web application to read documents, cite data, and access data (BETA release)

We’re very excited to announce a new feature at Coptic SCRIPTORIUM.  We’ve created a new online web application that we think will allow users to read and reference our material much more easily.

Users can read Coptic documents on HTML pages taken from the data visualizations.  There are also easy links to our search tool ANNIS and to our GitHub repository for downloading files.

And we have a system of canonical URNS that provide persisent identifiers for documents, texts, authors, and text groups.   This means you can cite our data in your scholarship, and then readers will be able to back to our site and find our most recent versions of the documents you have cited.

We’ve got a little video to introduce it, or dive right in at http://data.copticscriptorium.org.

This is a BETA release, which means you might see a few things that need to be ironed out.  (For one thing, our small corpus of documentary papyri are not yet in the system — stay tuned, and in the meanwhile you can still read and query them in ANNIS.)  We are pretty pleased with how it’s turning out and look forward to future developments.

Many thanks to Bridget Almas of the Perseus Digital Library for helping us develop a canonical referencing system, and to Archimedes Digital for implementing the application.

 

 

Download release of all corpora in TEI XML, PAULA XML, relANNIS

We’ve released some new corpora (the papyri.info texts, for example) and some new documents to our existing corpora.  You can download everything in three different formats from our GitHub repository.  TEI XML, PAULA XML, and relANNIS.

Releasing new translation of section of Shenoute’s Acephalous Work 22

An English Translation (by Anthony Alcock) of part of Shenoute’s Acephalous Work 22 is available.  Anthony Alcock of the University of Kassel has contributed a translation of White Monastery Manuscript YA (MONB.YA) pages 421-28. This section corresponds to Leipoldt’s vol. 4, pp. 124-29. Coptic, English, and various annotations are available. Many thanks to Dr. Alcock for the contribution! We are in the process of a major addition to our website functionality, to enable you to read and find these texts more easily. In the meantime, you can access the text via our ANNIS search and visualization tool.  Click on the little page icon next to the shenoute.a22 corpus listing to see the visualizations.

Screen Shot 2015-06-11 at 3.50.07 PM of ANNIS corpus list

List of corpora in ANNIS

Read the English translation directly in the linguistic analysis view; read it as a pop-up when you hover over the Coptic in the normalized view.

screenshot: list of visualizations in ANNIS

Or search the English in ANNIS using a search string; to search for the word “work” in the English translations of Acephalous Work 22, use translation=/.*work.*/.

(Originally posted in March 2015 at http://copticscriptorium.org/)

Entire Sahidica New Testament now available

The entire Sahidica New Testament (machine-annotated) is now available. It has been tokenized and tagged for part of speech entirely automatically, using our tools. There has been no manual editing or correction. Visit our corpora for more information, or just jump in and search it in ANNIS.

 

(Originally posted in March 2015 at http://copticscriptorium.org/)

« Older posts Newer posts »