clarin-knext/wsd_polish_datasets
收藏Hugging Face2024-02-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/clarin-knext/wsd_polish_datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- pl
language_creators:
- expert-generated
- found
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: wsd-polish-datasets
size_categories:
- 1M<n<10M
source_datasets:
- original
tags: []
task_categories:
- token-classification
task_ids:
- word-sense-disambiguation
---
# Word Sense Disambiguation Corpora for Polish
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:** https://link.springer.com/chapter/10.1007/978-3-031-08754-7_70
- **Point of Contact:** arkadiusz.janz@pwr.edu.pl
### Dataset Summary
`WSD Polish Datasets` is a comprehensive benchmark for word sense disambiguation (WSD) classification task in Polish language.
It consists of 7 distinct datasets, manually annotated with senses from plWordNet-4.5 sense inventory. The following datasets
were annotated and included into our benchmark:
- KPWr
- KPWr-100
- Sherlock (SPEC)
- Skladnica
- WikiGlex (a subset of GLEX corpus)
- EmoGlex (a subset of GLEX corpus)
- Walenty
For more details, please check the following publication:
```
@InProceedings{10.1007/978-3-031-08754-7_70,
author="Janz, Arkadiusz
and Dziob, Agnieszka
and Oleksy, Marcin
and Baran, Joanna",
editor="Groen, Derek
and de Mulatier, Cl{\'e}llia
and Paszynski, Maciej
and Krzhizhanovskaya, Valeria V.
and Dongarra, Jack J.
and Sloot, Peter M. A.",
title="A Unified Sense Inventory for Word Sense Disambiguation in Polish",
booktitle="Computational Science -- ICCS 2022",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="682--689",
isbn="978-3-031-08754-7"
}
```
**A new publication on Polish WSD corpora will be available soon**
### Supported Tasks and Leaderboards
Word sense disambiguation task. We do not provide a leaderboard. However, we provide an example evaluation script for evaluating WSD models.
### Languages
Polish language, PL
## Dataset Structure
### Data Instances
Data are structured in JSONL format, each single text sample is divided by sentence.
```
{
"text": "Wpierw pani Hudson została zerwana z łóżka, po czym odegrała się na mnie, a ja - na tobie.",
"tokens": [
{
"index": 0,
"position": [ 0, 6 ],
"orth": "Wpierw",
"lemma": "wpierw",
"pos": "adv",
"ctag": "adv"
},
{
"index": 1,
"position": [ 7, 11 ],
"orth": "pani",
"lemma": "pani",
"pos": "noun",
"ctag": "subst:nom:f:sg"
},
{
"index": 2,
"position": [ 12, 18 ],
"orth": "Hudson",
"lemma": "Hudson",
"pos": "noun",
"ctag": "subst:nom:f:sg"
},
{
"index": 3,
"position": [ 19, 26 ],
"orth": "została",
"lemma": "zostać",
"pos": "verb",
"ctag": "praet:perf:f:sg"
},
{
"index": 4,
"position": [ 27, 34 ],
"orth": "zerwana",
"lemma": "zerwać",
"pos": "verb",
"ctag": "ppas:perf:nom:f:aff:sg"
},
<...>
],
"phrases": [
{
"indices": [ 10, 11 ],
"head": 10,
"lemma": "odegrać się"
}
],
"wsd": [
{
"index": 0,
"pl_sense": "wpierw.1.r",
"plWN_syn_id": "01a4a067-aac5-11ed-aae5-0242ac130002",
"plWN_lex_id": "f2757c30-aac4-11ed-aae5-0242ac130002",
"plWN_syn_legacy_id": "477654",
"plWN_lex_legacy_id": "718454",
"PWN_syn_id": "00102736-r",
"bn_syn_id": "bn:00115376r",
"mapping_relation": "synonymy"
},
{
"index": 1,
"pl_sense": "pani.2.n",
"plWN_syn_id": "f35fb1ed-aac4-11ed-aae5-0242ac130002",
"plWN_lex_id": "d5145565-aac4-11ed-aae5-0242ac130002",
"plWN_syn_legacy_id": "129",
"plWN_lex_legacy_id": "20695",
"PWN_syn_id": "10787470-n",
"bn_syn_id": "bn:00001530n",
"mapping_relation": "synonymy"
},
<...>
]
}
```
### Data Fields
Description of json keys:
- `text`: text of the sentence
- `tokens`: list of tokens made by tokenization process
- `index`: token order index in sentence
- `position`: token chars span indices <included, excluded>
- `orth`: word
- `lemma`: lemmatised word
- `pos`: part of speech
- `ctag`: morphosyntactic tag
- `phrases`: list of multi-word
- `wsd`: annotation labels for the WSD task
### Data Splits
We do not specify an exact data split for training and evaluation. However, we suggest to use GLEX and Składnica for training and other datasets for testing.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection, Normalization and Post-processing
Source corpora were initially pre-processed using morphosyntactic tagging and multi-word expression recognition tools.
To tokenize and tag the datasets we used [MorphoDiTa](https://clarin-pl.eu/dspace/handle/11321/425) adapted to Polish language. To recognize multi-word expressions
we applied pattern-based matching tool [Corpus2-MWE](https://clarin-pl.eu/dspace/handle/11321/533) - only MWEs from plWordNet were included. After manual annotation,
sense indices of plWordNet 4.5 were mapped automatically to Princeton WordNet 3.0 and BabelNet 4.0 indices using plWordNet's interlingual mapping.
### Annotations
#### Annotation process
* 2+1 annotation process with inter-annotator agreement score over 0.6 PSA
* annotated with [plWordNet 4.5](http://plwordnet.pwr.wroc.pl/wordnet/)
* software: [WordNet-Loom](https://clarin-pl.eu/dspace/handle/11321/275) and [Inforex](https://clarin-pl.eu/dspace/handle/11321/13)
* both single-word and multi-word expressions annotated
* full-text sense annotation (excluding KPWr)
#### Who are the annotators?
- professional linguists from CLARIN-PL project
### Personal and Sensitive Information
The datasets do not contain any personal or sensitive information.
## Considerations for Using the Data
### Discussion of Biases
Some datasets are biased towards most frequent senses. No information about other biases - needs further analysis.
### Other Known Limitations
* sense inventories are usually incomplete therefore some word senses might be missing in plWordNet
* single-word and multi-word terms expressing novel senses (missing in plWordNet) were not marked
## Additional Information
### Dataset Curators
Arkadiusz Janz (arkadiusz.janz@pwr.edu.pl)
### Licensing Information
KPWR-100 [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
KPWR [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
Walenty [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
Sherlock [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
Skladnica [GNU GPL 3](http://www.gnu.org/licenses/gpl-3.0.en.html)
GLEX [plWordNet License](http://plwordnet.pwr.wroc.pl/wordnet/licence)
### Citation Information
Main source (all corpora as a unified benchmark) and published here on HuggingFace:
````
@InProceedings{10.1007/978-3-031-08754-7_70,
author="Janz, Arkadiusz
and Dziob, Agnieszka
and Oleksy, Marcin
and Baran, Joanna",
editor="Groen, Derek
and de Mulatier, Cl{\'e}llia
and Paszynski, Maciej
and Krzhizhanovskaya, Valeria V.
and Dongarra, Jack J.
and Sloot, Peter M. A.",
title="A Unified Sense Inventory for Word Sense Disambiguation in Polish",
booktitle="Computational Science -- ICCS 2022",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="682--689",
isbn="978-3-031-08754-7"
}
````
Related work
------------
KPWr-100, Składnica, SPEC
````
@article{janzresults,
title={Results of the PolEval 2020 Shared Task 3: Word Sense Disambiguation},
author={Janz, Arkadiusz and Chlebus, Joanna and Dziob, Agnieszka and Piasecki, Maciej},
journal={Proceedings of the PolEval 2020 Workshop},
pages={65--77},
year={2020}
}
````
GLEX (EmoGLEX)
````
@article{janz2017plwordnet,
title={{plWordNet} as a basis for large emotive lexicons of Polish},
author={Janz, Arkadiusz and Kocon, Jan and Piasecki, Maciej and Zasko-Zielinska, Monika},
journal={Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics Poznan: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu},
pages={189--193},
year={2017}
}
````
KPWr
````
@conference{broda2012,
address = {Istanbul, Turkey},
author = {Bartosz Broda and Micha{\l} Marci{\'n}czuk and Marek Maziarz and Adam Radziszewski and Adam Wardy{\'n}ski},
booktitle = {Proceedings of LREC'12},
owner = {Marlena},
publisher = {ELRA},
timestamp = {2014.06.20},
title = {KPWr: Towards a Free Corpus of Polish},
year = {2012}
}
````
Składnica
````
@inproceedings{hajnicz-2014-lexico,
title = "Lexico-Semantic Annotation of Sk{\l}adnica Treebank by means of {PLWN} Lexical Units",
author = "Hajnicz, El{\.z}bieta",
booktitle = "Proceedings of the Seventh Global {W}ordnet Conference",
month = jan,
year = "2014",
address = "Tartu, Estonia",
publisher = "University of Tartu Press",
url = "https://aclanthology.org/W14-0104",
pages = "23--31",
}
````
Walenty
````
@inproceedings{haj:and:bar:lrec16,
author = {Hajnicz, El{\.z}bieta and Andrzejczuk, Anna and Bartosiak, Tomasz},
crossref = {lrec:16},
pages = {2625--2632},
pdf = {http://www.lrec-conf.org/proceedings/lrec2016/pdf/382_Paper.pdf},
title = {Semantic Layer of the Valence Dictionary of {P}olish \emph{{W}alenty}}
}
````
Mapping plWordNet onto Princeton WordNet
````
@inproceedings{rudnicka-etal-2021-non,
title = "A (Non)-Perfect Match: Mapping pl{W}ord{N}et onto {P}rinceton{W}ord{N}et",
author = "Rudnicka, Ewa and
Witkowski, Wojciech and
Piasecki, Maciej",
booktitle = "Proceedings of the 11th Global Wordnet Conference",
month = jan,
year = "2021",
address = "University of South Africa (UNISA)",
publisher = "Global Wordnet Association",
url = "https://aclanthology.org/2021.gwc-1.16",
pages = "137--146"
}
````
提供机构:
clarin-knext
原始信息汇总
数据集概述
名称: WSD Polish Datasets
语言: 波兰语 (pl)
许可证: CC-BY-4.0
多语言性: 单语
大小: 1M<n<10M
源数据: 原始数据
任务类别: 词性标注
任务ID: 词义消歧
数据集内容
数据集摘要:
WSD Polish Datasets 是一个针对波兰语词义消歧(WSD)分类任务的综合基准。它包含7个不同的数据集,手动标注了来自plWordNet-4.5的词义。包括以下数据集:
- KPWr
- KPWr-100
- Sherlock (SPEC)
- Skladnica
- WikiGlex (GLEX语料库的子集)
- EmoGlex (GLEX语料库的子集)
- Walenty
支持的任务和排行榜: 词义消歧任务。不提供排行榜,但提供了一个示例评估脚本用于评估WSD模型。
数据结构:
- 数据实例: 数据以JSONL格式组织,每个文本样本按句子分割。
- 数据字段:
text: 句子文本tokens: 通过分词过程生成的词列表index: 词在句子中的顺序索引position: 词的字符跨度索引orth: 词lemma: 词形还原后的词pos: 词性ctag: 形态句法标签
phrases: 多词表达的列表wsd: WSD任务的标注标签
数据分割: 未指定确切的数据分割用于训练和评估,但建议使用GLEX和Składnica进行训练,其他数据集用于测试。
数据集创建
源数据: 源语料库经过形态句法标注和多词表达识别工具的预处理。使用MorphoDiTa进行分词和标注,使用Corpus2-MWE识别多词表达。
标注过程:
- 2+1标注过程,注释者间一致性分数超过0.6 PSA
- 使用plWordNet 4.5进行标注
- 使用WordNet-Loom和Inforex软件
- 标注了单词和多词表达
- 全文词义标注(KPWr除外)
注释者: 来自CLARIN-PL项目的专业语言学家。
使用数据的考虑
偏见讨论: 某些数据集偏向于最常见的词义。其他偏见的信息不足,需要进一步分析。
其他已知限制:
- 词义库存通常不完整,因此某些词义可能在plWordNet中缺失
- 表达新词义的单词和多词术语(在plWordNet中缺失)未被标记



