alvations/pywsd-datasets
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/alvations/pywsd-datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: pywsd-datasets
license: mit
task_categories:
- token-classification
language:
- en
tags:
- word-sense-disambiguation
- wsd
- wordnet
- oewn
- semcor
- semeval
- senseval
configs:
- config_name: en-senseval2-aw
data_files:
- split: test
path: data/en-senseval2-aw/test.parquet
- config_name: en-senseval3-aw
data_files:
- split: test
path: data/en-senseval3-aw/test.parquet
- config_name: en-semeval2007-aw
data_files:
- split: test
path: data/en-semeval2007-aw/test.parquet
- config_name: en-semeval2013-aw
data_files:
- split: test
path: data/en-semeval2013-aw/test.parquet
- config_name: en-semeval2015-aw
data_files:
- split: test
path: data/en-semeval2015-aw/test.parquet
- config_name: en-semcor
data_files:
- split: train
path: data/en-semcor/train.parquet
- config_name: en-wngt
data_files:
- split: train
path: data/en-wngt/train.parquet
- config_name: en-masc
data_files:
- split: train
path: data/en-masc/train.parquet
- config_name: en-senseval2_ls
data_files:
- split: train
path: data/en-senseval2_ls/train.parquet
- split: test
path: data/en-senseval2_ls/test.parquet
- config_name: en-senseval3_ls
data_files:
- split: train
path: data/en-senseval3_ls/train.parquet
- split: test
path: data/en-senseval3_ls/test.parquet
- config_name: en-semeval2007_t17_ls
data_files:
- split: test
path: data/en-semeval2007_t17_ls/test.parquet
---
# pywsd-datasets
Unified Word Sense Disambiguation benchmark datasets, normalized to **modern
`wn` lexicon sense IDs** (`oewn:2024` for English, OMW for other languages).
Companion to [pywsd](https://pypi.org/project/pywsd/) ≥ 1.3.0.
## What's shipped (v0.2)
**English, test-only Raganato all-words benchmark:**
| Config | Instances | OEWN 2024 coverage |
|-----------------------|-----------|--------------------|
| `en-senseval2-aw` | 2,282 | 99.43 % |
| `en-senseval3-aw` | 1,850 | 99.51 % |
| `en-semeval2007-aw` | 455 | 99.78 % |
| `en-semeval2013-aw` | 1,644 | 100.00 % |
| `en-semeval2015-aw` | 1,022 | 99.32 % |
**English, training corpora (via UFSAC v2.1):**
| Config | Split | OEWN 2024 coverage |
|---------------------------|-------|--------------------|
| `en-semcor` | train | see coverage_report |
| `en-wngt` | train | see coverage_report |
| `en-masc` | train | see coverage_report |
| `en-senseval2_ls` | train + test | lexical-sample |
| `en-senseval3_ls` | train + test | lexical-sample |
| `en-semeval2007_t17_ls` | test | lexical-sample |
Run `python -m pywsd_datasets.scripts.coverage_report` locally to get
up-to-date OEWN resolution rates after rebuilding.
## Install
```bash
pip install pywsd-datasets
```
## Use via HuggingFace `datasets`
```python
from datasets import load_dataset
# Raganato all-words evaluation set
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")
# SemCor training data
ds = load_dataset("alvations/pywsd-datasets", "en-semcor")
ds["test"][0] if "test" in ds else ds["train"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
# 'split': 'test', 'lang': 'en',
# 'tokens': ['The', 'art', 'of', 'change-ringing', ...],
# 'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
# 'source_sense_id': 'art%1:09:00::',
# 'source_sense_system': 'pwn_sensekey_3.0',
# 'sense_ids_wordnet': ['oewn-05646832-n'],
# 'wordnet_lexicon': 'oewn:2024', ...}
```
## Use via the loader package
```python
from pywsd_datasets.loaders.raganato import iter_instances as iter_raganato
from pywsd_datasets.loaders.ufsac import iter_instances as iter_ufsac
for inst in iter_raganato("senseval2"):
print(inst.target_lemma, inst.sense_ids_wordnet)
for inst in iter_ufsac("semcor", "/path/to/ufsac-public-2.1"):
print(inst.target_lemma, inst.sense_ids_wordnet)
```
## Rebuild locally
```bash
pip install pywsd-datasets[dev]
# Raganato only (always works, ~1 MB fetch from our GH release mirror)
python -m pywsd_datasets.scripts.build_all
# With UFSAC corpora — download ufsac-public-2.1 separately (see below)
python -m pywsd_datasets.scripts.build_all \
--ufsac-root ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1
# Coverage report across every built parquet:
python -m pywsd_datasets.scripts.coverage_report
```
### UFSAC download
UFSAC v2.1 is distributed as a single Google Drive tarball
(`ufsac-public-2.1.tar.xz`, ~196 MB). Fetch with `gdown`:
```bash
pip install gdown
mkdir -p ~/.cache/pywsd-datasets/ufsac
gdown 'https://drive.google.com/uc?id=1kwBMIDBTf6heRno9bdLvF-DahSLHIZyV' \
-O ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1.tar.xz
cd ~/.cache/pywsd-datasets/ufsac && tar -xf ufsac-public-2.1.tar.xz
```
## Schema
Every row follows [`WSDInstance`](src/pywsd_datasets/schema.py):
```
instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id
```
`sense_ids_wordnet` is list-valued to handle multi-gold instances and any
PWN-3.0 key that splits into multiple OEWN 2024 synsets.
## Multilingual / XL-WSD / BabelNet — deferred
`loaders/xl_wsd.py` exists as a stub and raises `NotImplementedError`.
`mappers/babelnet_to_wn.py` is similarly unused. **Why:**
* XL-WSD uses BabelNet synset IDs as gold labels; resolving them to
modern `wn` lexicon IDs requires the BabelNet → PWN 3.0 bridge file,
which is distributed **only with a BabelNet academic license**.
* XL-WSD itself is CC-BY-NC 4.0 — we don't redistribute the data.
Reviving this path requires (a) a BabelNet license, (b) loading
`bn_to_wn.txt` via `babelnet_to_wn.load_bn_to_pwn3_map()`, (c) selecting
per-language OMW lexicons via `mappers.omw_lookup.lexicon_for(lang)`,
then (d) chaining through `pwn3_to_oewn.pwn3_sensekey_to_wn(key, lexicon=...)`.
All four pieces are in place — wiring them is blocked on the BabelNet
mapping file. See the module docstrings for details.
## Roadmap
* **v0.2** (this release): Raganato all-words evaluation + UFSAC training
corpora (SemCor, WNGT, MASC, Senseval lexical-sample).
* **v0.3** (planned): WiC (CC-BY-NC — loader-only), CoarseWSD-20.
* **Deferred:** XL-WSD multilingual (needs BabelNet academic license).
## Citation
If you use these datasets please cite the original sources:
* Raganato, Camacho-Collados, Navigli (2017). *Word Sense Disambiguation:
A Unified Evaluation Framework and Empirical Comparison.* EACL.
* Vial, Lecouteux, Schwab (2018). *UFSAC: Unification of Sense Annotated
Corpora and Tools.* LREC.
* Plus the specific evaluation or training set paper (Senseval-2 / 3,
SemEval-2007 T17, SemEval-2013 T12, SemEval-2015 T13, SemCor,
WNGT/Princeton Gloss Corpus, MASC).
## License
MIT for the code. Each dataset keeps its original license — see the source
papers. Raganato bundle and SemEval shared-task data are
research-unrestricted; UFSAC is MIT.
## Sense-ID mapping details
PWN 3.0 sense keys are resolved against OEWN 2024 via
[`wn.compat.sensekey`](https://github.com/goodmami/wn). The few percent of
keys that fail to resolve are typically WN 3.0 synsets that OEWN split,
merged, or removed — those rows ship with an empty `sense_ids_wordnet` list
so the coverage report can flag them. Background:
* Kaf (2023). *Mapping Wordnets on the Fly with Permanent Sense Keys.*
arXiv:2303.01847.
## Known issues
* The upstream Raganato zip at `http://lcl.uniroma1.it/wsdeval/` serves a
mismatched TLS cert; our loader prefers the mirror on this repo's
GitHub release assets and falls back to the original URL over HTTP.
* UFSAC v2.1 is distributed as a Google Drive tarball; the loader assumes
you have it unpacked locally. A future release may mirror it.
---
pretty_name: pywsd-datasets
license: MIT
task_categories:
- 词元分类
language:
- 英语
tags:
- 词义消歧(word-sense-disambiguation)
- WSD
- 词网(WordNet)
- 开放英语词网(Open English WordNet, OEWN)
- SemCor语料库
- SemEval评测
- Senseval评测
configs:
- config_name: en-senseval2-aw
data_files:
- split: test
path: data/en-senseval2-aw/test.parquet
- config_name: en-senseval3-aw
data_files:
- split: test
path: data/en-senseval3-aw/test.parquet
- config_name: en-semeval2007-aw
data_files:
- split: test
path: data/en-semeval2007-aw/test.parquet
- config_name: en-semeval2013-aw
data_files:
- split: test
path: data/en-semeval2013-aw/test.parquet
- config_name: en-semeval2015-aw
data_files:
- split: test
path: data/en-semeval2015-aw/test.parquet
- config_name: en-semcor
data_files:
- split: train
path: data/en-semcor/train.parquet
- config_name: en-wngt
data_files:
- split: train
path: data/en-wngt/train.parquet
- config_name: en-masc
data_files:
- split: train
path: data/en-masc/train.parquet
- config_name: en-senseval2_ls
data_files:
- split: train
path: data/en-senseval2_ls/train.parquet
- split: test
path: data/en-senseval2_ls/test.parquet
- config_name: en-senseval3_ls
data_files:
- split: train
path: data/en-senseval3_ls/train.parquet
- split: test
path: data/en-senseval3_ls/test.parquet
- config_name: en-semeval2007_t17_ls
data_files:
- split: test
path: data/en-semeval2007_t17_ls/test.parquet
---
# pywsd-datasets
统一化词义消歧基准数据集,已归一化为**现代词网(WordNet, wn)词表义项ID**(英语采用`oewn:2024`,其他语言采用多语言词网(Open Multilingual Wordnet, OMW))。
本数据集为[pywsd](https://pypi.org/project/pywsd/) ≥1.3.0版本的配套工具。
## 本版本(v0.2)包含内容
**英语仅测试集Raganato全词消歧基准数据集:**
| 配置名称 | 实例数 | OEWN 2024 覆盖度 |
|-----------------------|-----------|--------------------|
| `en-senseval2-aw` | 2,282 | 99.43 % |
| `en-senseval3-aw` | 1,850 | 99.51 % |
| `en-semeval2007-aw` | 455 | 99.78 % |
| `en-semeval2013-aw` | 1,644 | 100.00 % |
| `en-semeval2015-aw` | 1,022 | 99.32 % |
**英语训练语料库(基于统一词义标注语料库与工具集(Unified Sense Annotated Corpora and Tools, UFSAC)v2.1构建):**
| 配置名称 | 数据集划分 | OEWN 2024 覆盖度 |
|---------------------------|-------|--------------------|
| `en-semcor` | train | 详见覆盖度报告 |
| `en-wngt` | train | 详见覆盖度报告 |
| `en-masc` | train | 详见覆盖度报告 |
| `en-senseval2_ls` | train + test | 词汇样本任务 |
| `en-senseval3_ls` | train + test | 词汇样本任务 |
| `en-semeval2007_t17_ls` | test | 词汇样本任务 |
本地执行`python -m pywsd_datasets.scripts.coverage_report`,即可在重建数据集后获取最新的OEWN匹配率。
## 安装方式
bash
pip install pywsd-datasets
## 通过HuggingFace `datasets`库使用
python
from datasets import load_dataset
# Raganato全词消歧评测集
ds = load_dataset("alvations/pywsd-datasets", "en-senseval2-aw")
# SemCor训练数据
ds = load_dataset("alvations/pywsd-datasets", "en-semcor")
ds["test"][0] if "test" in ds else ds["train"][0]
# {'instance_id': 'd000.s000.t000', 'dataset': 'senseval2_aw',
# 'split': 'test', 'lang': 'en',
# 'tokens': ['The', 'art', 'of', 'change-ringing', ...],
# 'target_idx': 1, 'target_lemma': 'art', 'target_pos': 'n',
# 'source_sense_id': 'art%1:09:00::',
# 'source_sense_system': 'pwn_sensekey_3.0',
# 'sense_ids_wordnet': ['oewn-05646832-n'],
# 'wordnet_lexicon': 'oewn:2024', ...}
## 通过加载器包使用
python
from pywsd_datasets.loaders.raganato import iter_instances as iter_raganato
from pywsd_datasets.loaders.ufsac import iter_instances as iter_ufsac
for inst in iter_raganato("senseval2"):
print(inst.target_lemma, inst.sense_ids_wordnet)
for inst in iter_ufsac("semcor", "/path/to/ufsac-public-2.1"):
print(inst.target_lemma, inst.sense_ids_wordnet)
## 本地重建数据集
bash
pip install pywsd-datasets[dev]
# 仅重建Raganato数据集(无需额外数据,约1 MB,从本仓库GitHub发布镜像拉取)
python -m pywsd_datasets.scripts.build_all
# 重建包含UFSAC语料库的完整数据集——需单独下载ufsac-public-2.1(详见下文)
python -m pywsd_datasets.scripts.build_all
--ufsac-root ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1
# 生成所有parquet文件的覆盖度报告:
python -m pywsd_datasets.scripts.coverage_report
### UFSAC语料库下载
UFSAC v2.1以单个Google Drive压缩包(`ufsac-public-2.1.tar.xz`,约196 MB)形式分发,可通过`gdown`工具获取:
bash
pip install gdown
mkdir -p ~/.cache/pywsd-datasets/ufsac
gdown 'https://drive.google.com/uc?id=1kwBMIDBTf6heRno9bdLvF-DahSLHIZyV'
-O ~/.cache/pywsd-datasets/ufsac/ufsac-public-2.1.tar.xz
cd ~/.cache/pywsd-datasets/ufsac && tar -xf ufsac-public-2.1.tar.xz
## 数据格式规范
每条数据均遵循[`WSDInstance`](src/pywsd_datasets/schema.py)的格式定义:
instance_id, dataset, split, task, lang,
tokens[], pos_tags[], lemmas[],
target_idx, target_lemma, target_pos,
source_sense_id, source_sense_system,
sense_ids_wordnet[], wordnet_lexicon,
doc_id, sent_id
`sense_ids_wordnet`为列表类型,用于支持多黄金标准义项,以及将拆分至多条OEWN 2024同义词集的普林斯顿词网3.0(Princeton WordNet 3.0, PWN 3.0)义项键进行适配。
## 多语言/跨语言词义消歧(XL-WSD)与巴别词网(BabelNet)相关功能——暂未实现
目前`loaders/xl_wsd.py`仅为占位符,执行时会抛出`NotImplementedError`异常;`mappers/babelnet_to_wn.py`同样未投入使用。**原因如下:**
* 跨语言词义消歧(XL-WSD)以巴别词网(BabelNet)同义词集ID作为黄金标准标签;将其映射至现代词网(WordNet, wn)词表义项ID需要BabelNet→普林斯顿词网3.0(PWN 3.0)桥接文件,该文件**仅随BabelNet学术许可证一同分发**。
* 跨语言词义消歧数据集本身采用CC-BY-NC 4.0许可证——本项目不会重新分发此类数据。
恢复该功能需要完成以下步骤:(a) 获取BabelNet学术许可证;(b) 通过`babelnet_to_wn.load_bn_to_pwn3_map()`加载`bn_to_wn.txt`文件;(c) 通过`mappers.omw_lookup.lexicon_for(lang)`获取对应语言的多语言词网(OMW)词表;(d) 通过`pwn3_to_oewn.pwn3_sensekey_to_wn(key, lexicon=...)`完成映射链。上述四个环节均已就绪——仅因BabelNet映射文件的问题导致功能暂未打通。详细信息请参阅模块文档字符串。
## 开发路线图
* **v0.2**(当前版本):包含Raganato全词消歧评测集与UFSAC训练语料库(SemCor、WNGT、MASC、Senseval词汇样本数据集)。
* **v0.3**(规划中):新增WiC(采用CC-BY-NC许可证,仅提供加载器)、CoarseWSD-20数据集。
* **暂未规划**:多语言跨语言词义消歧数据集(需BabelNet学术许可证支持)。
## 引用说明
若您使用本数据集,请引用原始数据源:
* Raganato, Camacho-Collados, Navigli (2017). *词义消歧:统一化评测框架与实证对比研究*(Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison). EACL会议.
* Vial, Lecouteux, Schwab (2018). *UFSAC:统一词义标注语料库与工具集*(UFSAC: Unification of Sense Annotated Corpora and Tools). LREC会议.
* 以及对应评测集或训练语料库的原始文献(Senseval-2/3、SemEval-2007 T17、SemEval-2013 T12、SemEval-2015 T13、SemCor、WNGT/普林斯顿释义语料库、MASC)。
## 许可证说明
本项目代码采用MIT许可证。各数据集保留其原始许可证——请参阅原始文献。Raganato数据集打包文件与SemEval共享任务数据无学术研究使用限制;UFSAC语料库采用MIT许可证。
## 义项ID映射细节
普林斯顿词网3.0(PWN 3.0)义项键通过[`wn.compat.sensekey`](https://github.com/goodmami/wn)工具映射至OEWN 2024。极少数无法完成映射的义项键通常对应OEWN已拆分、合并或删除的WN 3.0同义词集——此类数据行的`sense_ids_wordnet`字段为空列表,以便覆盖度报告对其进行标记。相关背景:
* Kaf (2023). *基于永久义项键的动态词网映射*(Mapping Wordnets on the Fly with Permanent Sense Keys). arXiv:2303.01847.
## 已知问题
* 上游Raganato数据集压缩包地址`http://lcl.uniroma1.it/wsdeval/`存在TLS证书不匹配问题;本项目加载器优先使用本仓库GitHub发布资源中的镜像地址,若失败则回退至原始HTTP地址。
* UFSAC v2.1以Google Drive压缩包形式分发;本加载器假设您已将其本地解压。未来版本可能会提供该语料库的镜像资源。
提供机构:
alvations



