GWSD: A Graded Word Sense Disambiguation Dataset

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14974454

下载链接

链接失效反馈

官方服务：

资源简介：

The GWSD Dataset (Graded Word Sense Disambiguation Dataset) is a sense-annotated dataset designed for studying diachronic word usage and semantic change. It contains: - 2584 word usages from the Oxford English Dicitonary (OED) and - 2584 automatically generated word usage examples. In particular, the automatically generated word usage are obtained using Janus, a fine-tuned language model trained on the Oxford English Dictionary (OED), allowing for temporally aligned and sense-specific word usage examples spanning historical periods from 1700–2010. Each usage is paired with a sense definition and human annotated how well the definition express the meaning of the word in that particular usage (0:Cannot decide, 1:Unrelaed, 2:Distantly Related, 3:Closely Related, 4:Identical)..We used Amazon Mechanical Turk to collect annotations from crowd workers based in the United States, Canada, the United Kingdom, or Australia.The dataset is particularly useful for word sense disambiguation (WSD), historical linguistics, lexical semantic change detection (LSCD), and diachronic NLP tasks. Dataset ContentEach entry in the dataset corresponds to a word sense usage example, structured as follows: Text: The full sentence containing the target word.Start, End: Character indices marking the position of the target word in the sentence.Lemma: The base (root) form of the target word.POS Tag: Part-of-speech tag (e.g., "nn" for nouns, "vb" for verbs, "jj" for adjectives).Sense Definition: The dictionary-provided meaning of the word in this context.Text Year: The historical year for which the usage is generated/originated.Text Source: The model/source from which the sentence was generated (i.e. OED/Janus).OED Ground Truth: The reference sense label from the Oxford English Dictionary (scale 1–4).Annotators: The list of human annotators who evaluated the sense correctness.Annotations: Scores provided by annotators, typically on a 0–4 scale.Annotation Time: The time (in seconds) taken by each annotator to assess the sentence. CitationIf you use this dataset in your research, please cite the following paper: Cassotti, P., & Tahmasebi, N. (2025). Sense-Specific Historical Word Usage Generation. TACL.

创建时间：

2025-03-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集