http://blog.copticscriptorium.org/2016/05/16/full-machine-annotated-new-testament-corpus-updated/
From the website:
We’ve updated and re-released our fully machine-annotated New
Testament corpus. sahidica.nt V2.1.0 contains the Sahidica NT text from
Warren Wells Sahidica online NT, with the following features:
The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized the lexical entry “ⲥⲱⲧⲙ”, compound words lemmatized to ⲥⲱⲧⲙ or to a lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word form ⲥⲱⲧⲙ, which our tokenizer did not catch:
As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ, ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ). We expect accuracy to increase as we incorporate more texts into our corpora that have been machine annotated and then manually edited.
Note: we know that our server is slow generating the documents for this corpus. It may take several minutes to load; please be patient. For faster access, use ANNIS. Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.
We hope this corpus is useful to researchers.
From the website:
- Annotated with our latest NLP tools (part of speech tagger 1.9, tokenizer 4.1.0, language tagger and lemmatizer include lexical entries from the Database and Dictionary of Greek Loanwords in Coptic (DDGLC))
- Now contains the morph layer (annotating compound words and Coptic morphs such ⲣⲉϥ- ⲙⲛⲧ- ⲁⲧ-)
- Visualizations for linguistic analysis
Search and queries
For searches and queries using our ANNIS database to find specific terms, for this corpus we recommend searching the normalized words using regular expressions (to capture instances of the desired word that may still be embedded in a Coptic bound group, instances that our tokenizer may have missed):- Search for ⲥⲱⲧⲙ: norm=/.*ⲥⲱⲧⲙ.*/
- Search for ⲉⲛⲧⲟⲗⲏ: norm=/.*ⲉⲛⲧⲟⲗⲏ.*/
The results include various forms of ⲥⲱⲧⲙ (including ⲥⲟⲧⲙ) lemmatized the lexical entry “ⲥⲱⲧⲙ”, compound words lemmatized to ⲥⲱⲧⲙ or to a lexical entry containing ⲥⲱⲧⲙ, and some bound groups containing the word form ⲥⲱⲧⲙ, which our tokenizer did not catch:
As you can see, most of the hits are accurate (e.g., ⲥⲟⲧⲙ, ⲁⲧⲥⲱⲧⲙ, ⲣⲁⲧⲥⲱⲧⲙ, ⲣⲉϥⲥⲱⲧⲙ); some of the Coptic bound groups did not tokenize properly (e.g., ⲉⲡⲥⲱⲧⲙ, ⲙⲁⲣⲟⲩⲥⲱⲧⲙ). We expect accuracy to increase as we incorporate more texts into our corpora that have been machine annotated and then manually edited.
Reading by individual chapter
You can also read these documents and see the linguistic analysis visualizations at data.copticscriptorium.org/urn:cts:copticLit:nt. The first documents you will see (Gospel of Mark, 1 Corinthians) are manually annotated. Scroll down for “New Testament,” which is the full, machine-annotated Sahidica New Testament. Click on “Chapter” to read each chapter as normalized Coptic (with English translation as a pop-up when you hover your cursor). Click on “Analytic” for the normalized Coptic, part of speech analysis, and English translation for each chapter. Please keep in mind the English translation provided is a free, open-access New Testament translation from the World English Bible; it is not a direct translation from the Coptic.Note: we know that our server is slow generating the documents for this corpus. It may take several minutes to load; please be patient. For faster access, use ANNIS. Visualizations to read the chapters are available by clicking on the corpus and the icon for visualizations.
We hope this corpus is useful to researchers.
No comments:
Post a Comment