http://blog.copticscriptorium.org/2016/05/16/full-machine-annotated-new-testament-corpus-updated/
From the website:
We’ve updated and re-released our fully machine-annotated New
Testament corpus. sahidica.nt V2.1.0 contains the Sahidica NT text from
Warren Wells Sahidica online NT, with the following features:
- Annotated with our latest NLP tools (part of speech tagger 1.9,
tokenizer 4.1.0, language tagger and lemmatizer include lexical entries
from the Database and Dictionary of Greek Loanwords in Coptic (DDGLC))
- Now contains the morph layer (annotating compound words and Coptic morphs such ⲣⲉϥ- ⲙⲛⲧ- ⲁⲧ-)
- Visualizations for linguistic analysis
Please keep in mind that this fully
machine-annotated corpus is more accurate than previous versions but will nonetheless contain more errors than a corpus manually corrected by a human.
Search and queries
For
searches and queries using our ANNIS database to
find specific terms, for this corpus we recommend searching the
normalized words using regular expressions (to capture instances of the
desired word that may still be embedded in a Coptic bound group,
instances that our tokenizer may have missed):
Lemma searches are now also possible. You may wish to search for the
lemma using regular expressions, as well, in order to find lemmas of
some compound words. For example, the following search will find
entries containing ⲥⲱⲧⲙ in the lemma:
The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized
the lexical entry “ⲥⲱⲧⲙ”, compound words lemmatized to ⲥⲱⲧⲙ or to a
lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word
form ⲥⲱⲧⲙ, which our tokenizer did not catch:
Frequency table of normalized words lemmatized to ⲥⲱⲧⲙ or a lemma form containing ⲥⲱⲧⲙ (May 2016 Sahidica corpus)
As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ,
ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize
properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ). We expect accuracy to increase as
we incorporate more texts into our corpora that have been machine
annotated and then manually edited.
Reading by individual chapter
You can also read these documents and see the linguistic analysis visualizations at
data.copticscriptorium.org/urn:cts:copticLit:nt.
The first documents you will see (Gospel of Mark, 1 Corinthians) are
manually annotated. Scroll down for “New Testament,” which is the full,
machine-annotated Sahidica New Testament. Click on “Chapter” to read
each chapter as normalized Coptic (with English translation as a pop-up
when you hover your cursor). Click on “Analytic” for the normalized
Coptic, part of speech analysis, and English translation for each
chapter. Please keep in mind the English translation provided is a
free, open-access New Testament translation from the World English
Bible; it is not a direct translation from the Coptic.
Note: we know that our server is slow generating the
documents for this corpus. It may take several minutes to load; please
be patient. For faster access, use ANNIS. Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.
Accessing document visualizations of the Sahidica corpus via ANNIS
We hope this corpus is useful to researchers.