Full, machine-annotated New Testament Corpus updated

May 16, 2016 / ctschroeder / 0 Comments

We’ve updated and re-released our fully machine-annotated New Testament corpus. sahidica.nt V2.1.0 contains the Sahidica NT text from Warren Wells Sahidica online NT, with the following features:

Annotated with our latest NLP tools (part of speech tagger 1.9, tokenizer 4.1.0, language tagger and lemmatizer include lexical entries from the Database and Dictionary of Greek Loanwords in Coptic (DDGLC))
Now contains the morph layer (annotating compound words and Coptic morphs such ⲣⲉϥ- ⲙⲛⲧ- ⲁⲧ-)
Visualizations for linguistic analysis

Please keep in mind that this fully machine-annotated corpus is more accurate than previous versions but will nonetheless contain more errors than a corpus manually corrected by a human.

Search and queries

For searches and queries using our ANNIS database to find specific terms, for this corpus we recommend searching the normalized words using regular expressions (to capture instances of the desired word that may still be embedded in a Coptic bound group, instances that our tokenizer may have missed):

Search for ⲥⲱⲧⲙ: norm=/.*ⲥⲱⲧⲙ.*/
Search for ⲉⲛⲧⲟⲗⲏ: norm=/.*ⲉⲛⲧⲟⲗⲏ.*/

Lemma searches are now also possible. You may wish to search for the lemma using regular expressions, as well, in order to find lemmas of some compound words. For example, the following search will find entries containing ⲥⲱⲧⲙ in the lemma:

lemma=/.*ⲥⲱⲧⲙ.*/

The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized the lexical entry “ⲥⲱⲧⲙ”, compound words lemmatized to ⲥⲱⲧⲙ or to a lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word form ⲥⲱⲧⲙ, which our tokenizer did not catch:

Frequency table of normalized words lemmatized to swtm or a lemma form containing swtm (May 2016 Sahidica corpus)

Frequency table of normalized words lemmatized to ⲥⲱⲧⲙ or a lemma form containing ⲥⲱⲧⲙ (May 2016 Sahidica corpus)

As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ, ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ). We expect accuracy to increase as we incorporate more texts into our corpora that have been machine annotated and then manually edited.

Reading by individual chapter

You can also read these documents and see the linguistic analysis visualizations at data.copticscriptorium.org/urn:cts:copticLit:nt. The first documents you will see (Gospel of Mark, 1 Corinthians) are manually annotated. Scroll down for “New Testament,” which is the full, machine-annotated Sahidica New Testament. Click on “Chapter” to read each chapter as normalized Coptic (with English translation as a pop-up when you hover your cursor). Click on “Analytic” for the normalized Coptic, part of speech analysis, and English translation for each chapter. Please keep in mind the English translation provided is a free, open-access New Testament translation from the World English Bible; it is not a direct translation from the Coptic.
Note: we know that our server is slow generating the documents for this corpus. It may take several minutes to load; please be patient. For faster access, use ANNIS. Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.

Accessing document visualizations of the Sahidica corpus via ANNIS

We hope this corpus is useful to researchers.

Online Resources for Biblical Studies

Wednesday, May 18, 2016

Coptic Scriptorium's full, machine-annotated NT

Full, machine-annotated New Testament Corpus updated

Search and queries

Reading by individual chapter

No comments:

Search This Blog

Blog Archive