Diorisis.duckdb
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11261145
下载链接
链接失效反馈官方服务:
资源简介:
Duckdb Compilation of the Diorisis Ancient Greek corpus
Description
The Diorsis Ancient Greek Corpus was created by Barbara McGillivray and Alessandro Vatri with sponorship and funding by the Alan Turing Institute. The original xml files are collectively available at https://www.doi.org/10.6084/m9.figshare.6187256.
An article introducing the corpus is available as: Vatri, A., & McGillivray, B. (2018). The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences, 3(1), 55-65. https://doi.org/10.1163/24523666-01000013
As the description states, the Diorisis corpus consists of "820 texts spanning between the beginnings of the AG literary tradition (Homer) and the fifth century AD, and it counts 10,206,421 words".
Rights and Permissions
The original Diorisis corpus is archived under a CC BY 4.0 international license.
The Diorisis duckdb database of the corpus, archived here for the first time, was built and compiled by Mark G. Bilby and is here archived under a CC BY-NC-ND 4.0 license. This license allows for anyone to download, use, and modify the duckdb database robustly for analysis/queries, but not to distribute a derivative database or use the database or derivatives of it as part of a commercial product or offering. Any other rights/permissions requests or clarifications can be sent to Mark.
Database Structure
The database contains two tables, "document" and "word". The table structures are as follows:
TABLE document
comb_tlg_id VARCHAR PRIMARY KEY [TLG id conformed to GlauX format] author VARCHAR [work author] title VARCHAR [work title] genre VARCHAR [work genre] subgenre VARCHAR [work subgenre] date_created VARCHAR [work date created] sent_count INT [sentence count] word_count INT [word count] punct_count INT [punctuation count] location VARCHAR [location of composition] glaux BOOLEAN [TRUE/FALSE document also in current GlauX corpus]
TABLE word
word_key VARCHAR PRIMARY KEY [word unique id] comb_tlg_id VARCHAR [FOREIGN KEY, TLG id conformed to GlauX format] sent_id VARCHAR [document sentence id] seq_id VARCHAR [document word id] self_word_id VARCHAR [sentence word id] self_form VARCHAR [word form] self_lemma_id VARCHAR [word lemma id] self_lemma VARCHAR [word lemma] self_pos VARCHAR [word part of speech] self_person VARCHAR [word person] self_number VARCHAR [word number] self_tense VARCHAR [word tense] self_mood VARCHAR [word mood] self_voice VARCHAR [word voice] self_gender VARCHAR [word gender] self_case VARCHAR [word case] self_degree VARCHAR [word degree] idiom VARCHAR [Greek idiom or dialect] prosody VARCHAR [metrical structure]
Data Sources and Conversions
The document table values are extracted from corresponding values in the Diorisis TEI-XML file headers, along with some document-wide counts of sentences, word, and punctuation marks. The comb_tlg_id reflects a modest transformation to conform the document ids to the convention used by the GlauX corpus for ease of comparison and correlation. The location column is a placeholder for future location inputs for each document.
The word table values are extracted from the body of each TEI-XML file. Beta characters for the "form" field were transformed to Unicode (UTF-8) using the perseids-tools beta-code-py package (https://github.com/perseids-tools/beta-code-py). Token attributes were typically renamed to conform to naming conventions used by Robert and Vanessa Gorman for syntactical treebanks of classical Greek texts.
Disclaimer
DATA AND ANY RELATED SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE AND DATA OR THE USE OF OR OTHER DEALINGS IN THE SOFTWARE OR DATA.
创建时间:
2024-05-25



