ACHILLES: Ancient and Historical Language Evaluation Set
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10655060
下载链接
链接失效反馈官方服务:
资源简介:
The dataset used in the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The task included four problems; problems 1-3 were offered in both constrained and unconstrained tracks on CodaLab, while problem 4 was only a part of the unconstrained track.
POS-tagging
Lemmatisation
Morphological feature prediction
Mask filling
Word-level
Character level
For problems 1-3, data from Universal Dependencies v.2.12 was used for Ancient Greek, Ancient Hebrew, Classical Chinese, Coptic, Gothic, medieval Icelandic, Latin, Old Church Slavonic, Old East Slavic, Old French and Vedic Sanskrit. Old Hungarian texts, annotated to the same standard as UD corpora, were added to the dataset from the MGTSZ website. In Old Hungarian data, tokens which were POS-tagged PUNCT were altered so that the form matched the lemma to simplify complex punctuation marks used to approximate manuscript symbols; otherwise, no characters were changed.
As the ISO 639-3 standard does not distinguish between historical stages of Latin, as it does between other languages like Irish, but it was desirable to approximate this distinction for Latin, we further split Latin data. This resulted in two Latin datasets: Classical and Late Latin, and Medieval Latin. This split was dictated by the composition of the Perseus and PROIEL treebanks that served as a source for Latin UD treebanks.
Historical forms of Irish were only included in mask filling challenges (problem 4), as the quantity of historical Irish text data which has been tokenised and annotated to a single standard to date is insufficient for the purpose of training models to perform morphological analysis tasks. The texts were drawn from CELT, Corpas Stairiúil na Gaeilge, and digital editions of the St. Gall glosses and the Würzburg glosses. Each Irish text taken from CELT is labelled "Old", "Middle" or "Early Modern" in accordance with the language labels provided in CELT metadata. Because CELT metadata relating to language stages and text dating is reliant on information provided by a variety of different editors of earlier print editions, this metadata can be inconsistent across the corpus and on occasion inaccurate. To mitigate complications arising from this, texts drawn from CELT were included in the dataset only if they had a single Irish language label and if the dates provided in CELT metadata for the text match the expected dates for the given period in the history of the Irish language.
The upper temporal boundary was set at 1700 CE, and texts created later than this date were not included in the dataset. The choice of this date is driven by the fact that most of the historical language data used in word embedding research dates back to the 18th century CE or later, and our intention was to focus on the more challenging and yet unaddressed data. The resulting datasets for each language were then shuffled at the sentence level and split into training, validation and test subsets at the ratio of 0.8 : 0.1 : 0.1.
A detailed list of text sources for each language in the dataset, as well as other metadata and the description of data formats used for each problem, is provided on the Shared Task's GitHub. The structure of the dataset is as follows:
📂 morphology (data for problems 1-3)
├── 📂 test ├── 📂 ref (reference data used in CodaLab competitions) ├── 📂 lemmatisation ├── 📂 morph_features └── 📂 pos_tagging └── 📂 src (source test data with labels)
├── 📂 train
└── 📂 valid📂 fill_mask_word (data for problem 4a) ├── 📂 test ├── 📂 ref (reference data used in CodaLab competitions) └── 📂 src (source test data with labels in 2 different formats) ├── 📂 json └── 📂 tsv ├── 📂 train (train data in 2 different formats) ├── 📂 json └── 📂 tsv └── 📂 valid (validation data in 2 different formats) ├── 📂 json └── 📂 tsv📂 fill_mask_char (data for problem 4b)
├── 📂 test ├── 📂 ref (reference data used in CodaLab competitions) └── 📂 src (source test data with labels in 2 different formats) ├── 📂 json └── 📂 tsv ├── 📂 train (train data in 2 different formats) ├── 📂 json └── 📂 tsv └── 📂 valid (validation data in 2 different formats) ├── 📂 json └── 📂 tsv
We would like to thank Ekaterina Melnikova for suggesting the name for the dataset.
创建时间:
2024-05-29



