Early Irish Analogy Dataset for Word Embedding Evaluation

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10652308

下载链接

链接失效反馈

官方服务：

资源简介：

An embedding evaluation dataset for Early Irish described in the paper "Do not Trust the Experts: How the Lack of Standard Complicates NLP for Historical Irish". Traditionally, analogy datasets are based on pairwise semantic proportion, and therefore every question has a single correct answer. Given the high level of variation in historical languages, such a strict definition of a correct answer seems unjustified. Therefore, Early Irish Analogy Dataset follows the Bigger Analogy Test Set (BATS) and provides several correct answers to each analogy question. Morphological and spelling variation data are extracted from the eDIL, a historical dictionary of medieval Irish. Unlike BATS, no distinction is made between inflection types due to eDIL's structure. The raw data amounted to 2,370 spelling variation and 9,690 morphological variation questions, from which 150 examples were randomly selected for each of the subsets to be comparable in size with the synonym and antonym subsets. The synonym and antonym subsets are translations of the correspondent BATS parts obtained by reverse-searching the eDIL and proofread by four expert evaluators. The dataset includes 98 entries in the synonym subset and 109 entries in the antonym subset, upon which three or more experts agreed.

创建时间：

2024-05-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集