LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3674098
下载链接
链接失效反馈官方服务:
资源简介:
This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:
a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`)
40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)
the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)
The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary.
__Corpus 1__
based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine
language: Latin
time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC
size: ~1.7 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
__Corpus 2__
based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine
language: Latin
time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD
size: ~9.4 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
Find more information on the data in the papers referenced below.
Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.
The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).
References
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.
McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
本数据集包含SemEval 2020任务1:无监督词汇语义变化检测的拉丁语测试数据:
- 一对拉丁语语料库(`corpus1/lemma`、`corpus2/lemma`)
- 针对两个语料库间词汇语义变化完成标注的40个词元(`targets.txt`)
- 任务1所用目标词的二元变化标注得分,以及任务2所用目标词的分级变化标注得分(`truth/`目录)
语料库数据已完成自动词元化与词性标注,并经人工部分校正。对于同形异义词,当词元在《刘易斯-肖特拉丁语词典》中的编号大于1时,词元后将跟随`#`符号及对应编号。例如,词元`dico`对应《刘易斯-肖特拉丁语词典》中的第一个同形异义词,`dico#2`则对应第二个同形异义词,详见《刘易斯-肖特拉丁语词典》。
**语料库1**
- 数据来源:Sketch Engine平台上的LatinISE语料库(McGillivray与Kilgarriff,2013)版本
- 语言:拉丁语
- 时间覆盖范围:公元前2世纪初至公元前1世纪末
- 规模:约170万Token
- 格式:已词元化,句子长度≥2,无标点符号,句子已随机打乱顺序
- 编码:UTF-8
**语料库2**
- 数据来源:Sketch Engine平台上的LatinISE语料库(McGillivray与Kilgarriff,2013)版本
- 语言:拉丁语
- 时间覆盖范围:公元1世纪初至公元21世纪末
- 规模:约940万Token
- 格式:已词元化,句子长度≥2,无标点符号,句子已随机打乱顺序
- 编码:UTF-8
有关本数据集的更多详情,请参阅下文引用的论文。
除本次SemEval 2020任务1所用的官方词元化语料库版本外,本数据集还提供原始标记版本(`corpus1/token/`、`corpus2/token/`),该版本包含与词元化版本顺序一致的原始句子。有关本数据集及SemEval 2020任务1的更多详情,请参阅下文引用的论文。
本数据集的构建得到了CRETA中心以及德国教育与研究部(BMBF)资助的CLARIN-D项目的支持。
**参考文献**
1. Dominik Schlechtweg、Barbara McGillivray、Simon Hengchen、Haim Dubossarsky与Nina Tahmasebi:《SemEval 2020任务1:无监督词汇语义变化检测》,将发表于SemEval@COLING2020。
2. McGillivray B、Kilgarriff A(2013):《历史语料库研究工具与拉丁语语料库》,载于Paul Bennett、Martin Durrell、Silke Scheible、Richard J. Whitt编:《历史语料库语言学新方法》,图宾根:纳尔出版社。
创建时间:
2020-08-21



