five

Early Irish Analogy Dataset for Word Embedding Evaluation

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10652308
下载链接
链接失效反馈
官方服务:
资源简介:
An embedding evaluation dataset for Early Irish described in the paper "Do not Trust the Experts: How the Lack of Standard Complicates NLP for Historical Irish". Traditionally, analogy datasets are based on pairwise semantic proportion, and therefore every question has a single correct answer. Given the high level of variation in historical languages, such a strict definition of a correct answer seems unjustified. Therefore, Early Irish Analogy Dataset follows the Bigger Analogy Test Set (BATS) and provides several correct answers to each analogy question.  Morphological and spelling variation data are extracted from the eDIL, a historical dictionary of medieval Irish. Unlike BATS, no distinction is made between inflection types due to eDIL's structure. The raw data amounted to 2,370 spelling variation and 9,690 morphological variation questions, from which 150 examples were randomly selected for each of the subsets to be comparable in size with the synonym and antonym subsets. The synonym and antonym subsets are translations of the correspondent BATS parts obtained by reverse-searching the eDIL and proofread by four expert evaluators. The dataset includes 98 entries in the synonym subset and 109 entries in the antonym subset, upon which three or more experts agreed.
创建时间:
2024-05-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作