LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/3674098

下载链接

链接失效反馈

官方服务：

资源简介：

This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection: a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.

本数据集包含SemEval 2020任务1：无监督词汇语义变化检测的拉丁语测试数据： - 一对拉丁语语料库（`corpus1/lemma`、`corpus2/lemma`） - 针对两个语料库间词汇语义变化完成标注的40个词元（`targets.txt`） - 任务1所用目标词的二元变化标注得分，以及任务2所用目标词的分级变化标注得分（`truth/`目录）语料库数据已完成自动词元化与词性标注，并经人工部分校正。对于同形异义词，当词元在《刘易斯-肖特拉丁语词典》中的编号大于1时，词元后将跟随`#`符号及对应编号。例如，词元`dico`对应《刘易斯-肖特拉丁语词典》中的第一个同形异义词，`dico#2`则对应第二个同形异义词，详见《刘易斯-肖特拉丁语词典》。 **语料库1** - 数据来源：Sketch Engine平台上的LatinISE语料库（McGillivray与Kilgarriff，2013）版本 - 语言：拉丁语 - 时间覆盖范围：公元前2世纪初至公元前1世纪末 - 规模：约170万Token - 格式：已词元化，句子长度≥2，无标点符号，句子已随机打乱顺序 - 编码：UTF-8 **语料库2** - 数据来源：Sketch Engine平台上的LatinISE语料库（McGillivray与Kilgarriff，2013）版本 - 语言：拉丁语 - 时间覆盖范围：公元1世纪初至公元21世纪末 - 规模：约940万Token - 格式：已词元化，句子长度≥2，无标点符号，句子已随机打乱顺序 - 编码：UTF-8 有关本数据集的更多详情，请参阅下文引用的论文。除本次SemEval 2020任务1所用的官方词元化语料库版本外，本数据集还提供原始标记版本（`corpus1/token/`、`corpus2/token/`），该版本包含与词元化版本顺序一致的原始句子。有关本数据集及SemEval 2020任务1的更多详情，请参阅下文引用的论文。本数据集的构建得到了CRETA中心以及德国教育与研究部（BMBF）资助的CLARIN-D项目的支持。 **参考文献** 1. Dominik Schlechtweg、Barbara McGillivray、Simon Hengchen、Haim Dubossarsky与Nina Tahmasebi：《SemEval 2020任务1：无监督词汇语义变化检测》，将发表于SemEval@COLING2020。 2. McGillivray B、Kilgarriff A（2013）：《历史语料库研究工具与拉丁语语料库》，载于Paul Bennett、Martin Durrell、Silke Scheible、Richard J. Whitt编：《历史语料库语言学新方法》，图宾根：纳尔出版社。

创建时间：

2020-08-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集