clarin-knext/wsd_polish_datasets

Name: clarin-knext/wsd_polish_datasets
Creator: clarin-knext
Published: 2024-02-11 16:34:17
License: 暂无描述

Hugging Face2024-02-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/clarin-knext/wsd_polish_datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - pl language_creators: - expert-generated - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: wsd-polish-datasets size_categories: - 1M<n<10M source_datasets: - original tags: [] task_categories: - token-classification task_ids: - word-sense-disambiguation --- # Word Sense Disambiguation Corpora for Polish ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** https://link.springer.com/chapter/10.1007/978-3-031-08754-7_70 - **Point of Contact:** arkadiusz.janz@pwr.edu.pl ### Dataset Summary `WSD Polish Datasets` is a comprehensive benchmark for word sense disambiguation (WSD) classification task in Polish language. It consists of 7 distinct datasets, manually annotated with senses from plWordNet-4.5 sense inventory. The following datasets were annotated and included into our benchmark: - KPWr - KPWr-100 - Sherlock (SPEC) - Skladnica - WikiGlex (a subset of GLEX corpus) - EmoGlex (a subset of GLEX corpus) - Walenty For more details, please check the following publication: ``` @InProceedings{10.1007/978-3-031-08754-7_70, author="Janz, Arkadiusz and Dziob, Agnieszka and Oleksy, Marcin and Baran, Joanna", editor="Groen, Derek and de Mulatier, Cl{\'e}llia and Paszynski, Maciej and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.", title="A Unified Sense Inventory for Word Sense Disambiguation in Polish", booktitle="Computational Science -- ICCS 2022", year="2022", publisher="Springer International Publishing", address="Cham", pages="682--689", isbn="978-3-031-08754-7" } ``` **A new publication on Polish WSD corpora will be available soon** ### Supported Tasks and Leaderboards Word sense disambiguation task. We do not provide a leaderboard. However, we provide an example evaluation script for evaluating WSD models. ### Languages Polish language, PL ## Dataset Structure ### Data Instances Data are structured in JSONL format, each single text sample is divided by sentence. ``` { "text": "Wpierw pani Hudson została zerwana z łóżka, po czym odegrała się na mnie, a ja - na tobie.", "tokens": [ { "index": 0, "position": [ 0, 6 ], "orth": "Wpierw", "lemma": "wpierw", "pos": "adv", "ctag": "adv" }, { "index": 1, "position": [ 7, 11 ], "orth": "pani", "lemma": "pani", "pos": "noun", "ctag": "subst:nom:f:sg" }, { "index": 2, "position": [ 12, 18 ], "orth": "Hudson", "lemma": "Hudson", "pos": "noun", "ctag": "subst:nom:f:sg" }, { "index": 3, "position": [ 19, 26 ], "orth": "została", "lemma": "zostać", "pos": "verb", "ctag": "praet:perf:f:sg" }, { "index": 4, "position": [ 27, 34 ], "orth": "zerwana", "lemma": "zerwać", "pos": "verb", "ctag": "ppas:perf:nom:f:aff:sg" }, <...> ], "phrases": [ { "indices": [ 10, 11 ], "head": 10, "lemma": "odegrać się" } ], "wsd": [ { "index": 0, "pl_sense": "wpierw.1.r", "plWN_syn_id": "01a4a067-aac5-11ed-aae5-0242ac130002", "plWN_lex_id": "f2757c30-aac4-11ed-aae5-0242ac130002", "plWN_syn_legacy_id": "477654", "plWN_lex_legacy_id": "718454", "PWN_syn_id": "00102736-r", "bn_syn_id": "bn:00115376r", "mapping_relation": "synonymy" }, { "index": 1, "pl_sense": "pani.2.n", "plWN_syn_id": "f35fb1ed-aac4-11ed-aae5-0242ac130002", "plWN_lex_id": "d5145565-aac4-11ed-aae5-0242ac130002", "plWN_syn_legacy_id": "129", "plWN_lex_legacy_id": "20695", "PWN_syn_id": "10787470-n", "bn_syn_id": "bn:00001530n", "mapping_relation": "synonymy" }, <...> ] } ``` ### Data Fields Description of json keys: - `text`: text of the sentence - `tokens`: list of tokens made by tokenization process - `index`: token order index in sentence - `position`: token chars span indices <included, excluded> - `orth`: word - `lemma`: lemmatised word - `pos`: part of speech - `ctag`: morphosyntactic tag - `phrases`: list of multi-word - `wsd`: annotation labels for the WSD task ### Data Splits We do not specify an exact data split for training and evaluation. However, we suggest to use GLEX and Składnica for training and other datasets for testing. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection, Normalization and Post-processing Source corpora were initially pre-processed using morphosyntactic tagging and multi-word expression recognition tools. To tokenize and tag the datasets we used [MorphoDiTa](https://clarin-pl.eu/dspace/handle/11321/425) adapted to Polish language. To recognize multi-word expressions we applied pattern-based matching tool [Corpus2-MWE](https://clarin-pl.eu/dspace/handle/11321/533) - only MWEs from plWordNet were included. After manual annotation, sense indices of plWordNet 4.5 were mapped automatically to Princeton WordNet 3.0 and BabelNet 4.0 indices using plWordNet's interlingual mapping. ### Annotations #### Annotation process * 2+1 annotation process with inter-annotator agreement score over 0.6 PSA * annotated with [plWordNet 4.5](http://plwordnet.pwr.wroc.pl/wordnet/) * software: [WordNet-Loom](https://clarin-pl.eu/dspace/handle/11321/275) and [Inforex](https://clarin-pl.eu/dspace/handle/11321/13) * both single-word and multi-word expressions annotated * full-text sense annotation (excluding KPWr) #### Who are the annotators? - professional linguists from CLARIN-PL project ### Personal and Sensitive Information The datasets do not contain any personal or sensitive information. ## Considerations for Using the Data ### Discussion of Biases Some datasets are biased towards most frequent senses. No information about other biases - needs further analysis. ### Other Known Limitations * sense inventories are usually incomplete therefore some word senses might be missing in plWordNet * single-word and multi-word terms expressing novel senses (missing in plWordNet) were not marked ## Additional Information ### Dataset Curators Arkadiusz Janz (arkadiusz.janz@pwr.edu.pl) ### Licensing Information KPWR-100 [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) KPWR [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) Walenty [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) Sherlock [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) Skladnica [GNU GPL 3](http://www.gnu.org/licenses/gpl-3.0.en.html) GLEX [plWordNet License](http://plwordnet.pwr.wroc.pl/wordnet/licence) ### Citation Information Main source (all corpora as a unified benchmark) and published here on HuggingFace: ```` @InProceedings{10.1007/978-3-031-08754-7_70, author="Janz, Arkadiusz and Dziob, Agnieszka and Oleksy, Marcin and Baran, Joanna", editor="Groen, Derek and de Mulatier, Cl{\'e}llia and Paszynski, Maciej and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.", title="A Unified Sense Inventory for Word Sense Disambiguation in Polish", booktitle="Computational Science -- ICCS 2022", year="2022", publisher="Springer International Publishing", address="Cham", pages="682--689", isbn="978-3-031-08754-7" } ```` Related work ------------ KPWr-100, Składnica, SPEC ```` @article{janzresults, title={Results of the PolEval 2020 Shared Task 3: Word Sense Disambiguation}, author={Janz, Arkadiusz and Chlebus, Joanna and Dziob, Agnieszka and Piasecki, Maciej}, journal={Proceedings of the PolEval 2020 Workshop}, pages={65--77}, year={2020} } ```` GLEX (EmoGLEX) ```` @article{janz2017plwordnet, title={{plWordNet} as a basis for large emotive lexicons of Polish}, author={Janz, Arkadiusz and Kocon, Jan and Piasecki, Maciej and Zasko-Zielinska, Monika}, journal={Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics Poznan: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu}, pages={189--193}, year={2017} } ```` KPWr ```` @conference{broda2012, address = {Istanbul, Turkey}, author = {Bartosz Broda and Micha{\l} Marci{\'n}czuk and Marek Maziarz and Adam Radziszewski and Adam Wardy{\'n}ski}, booktitle = {Proceedings of LREC'12}, owner = {Marlena}, publisher = {ELRA}, timestamp = {2014.06.20}, title = {KPWr: Towards a Free Corpus of Polish}, year = {2012} } ```` Składnica ```` @inproceedings{hajnicz-2014-lexico, title = "Lexico-Semantic Annotation of Sk{\l}adnica Treebank by means of {PLWN} Lexical Units", author = "Hajnicz, El{\.z}bieta", booktitle = "Proceedings of the Seventh Global {W}ordnet Conference", month = jan, year = "2014", address = "Tartu, Estonia", publisher = "University of Tartu Press", url = "https://aclanthology.org/W14-0104", pages = "23--31", } ```` Walenty ```` @inproceedings{haj:and:bar:lrec16, author = {Hajnicz, El{\.z}bieta and Andrzejczuk, Anna and Bartosiak, Tomasz}, crossref = {lrec:16}, pages = {2625--2632}, pdf = {http://www.lrec-conf.org/proceedings/lrec2016/pdf/382_Paper.pdf}, title = {Semantic Layer of the Valence Dictionary of {P}olish \emph{{W}alenty}} } ```` Mapping plWordNet onto Princeton WordNet ```` @inproceedings{rudnicka-etal-2021-non, title = "A (Non)-Perfect Match: Mapping pl{W}ord{N}et onto {P}rinceton{W}ord{N}et", author = "Rudnicka, Ewa and Witkowski, Wojciech and Piasecki, Maciej", booktitle = "Proceedings of the 11th Global Wordnet Conference", month = jan, year = "2021", address = "University of South Africa (UNISA)", publisher = "Global Wordnet Association", url = "https://aclanthology.org/2021.gwc-1.16", pages = "137--146" } ````

提供机构：

clarin-knext

原始信息汇总

数据集概述

名称: WSD Polish Datasets

语言: 波兰语 (pl)

许可证: CC-BY-4.0

多语言性: 单语

大小: 1M<n<10M

源数据: 原始数据

任务类别: 词性标注

任务ID: 词义消歧

数据集内容

数据集摘要: WSD Polish Datasets 是一个针对波兰语词义消歧（WSD）分类任务的综合基准。它包含7个不同的数据集，手动标注了来自plWordNet-4.5的词义。包括以下数据集：

KPWr
KPWr-100
Sherlock (SPEC)
Skladnica
WikiGlex (GLEX语料库的子集)
EmoGlex (GLEX语料库的子集)
Walenty

支持的任务和排行榜: 词义消歧任务。不提供排行榜，但提供了一个示例评估脚本用于评估WSD模型。

数据结构:

数据实例: 数据以JSONL格式组织，每个文本样本按句子分割。
数据字段:
- text: 句子文本
- tokens: 通过分词过程生成的词列表
  - index: 词在句子中的顺序索引
  - position: 词的字符跨度索引
  - orth: 词
  - lemma: 词形还原后的词
  - pos: 词性
  - ctag: 形态句法标签
- phrases: 多词表达的列表
- wsd: WSD任务的标注标签

数据分割: 未指定确切的数据分割用于训练和评估，但建议使用GLEX和Składnica进行训练，其他数据集用于测试。

数据集创建

源数据: 源语料库经过形态句法标注和多词表达识别工具的预处理。使用MorphoDiTa进行分词和标注，使用Corpus2-MWE识别多词表达。

标注过程:

2+1标注过程，注释者间一致性分数超过0.6 PSA
使用plWordNet 4.5进行标注
使用WordNet-Loom和Inforex软件
标注了单词和多词表达
全文词义标注（KPWr除外）

注释者: 来自CLARIN-PL项目的专业语言学家。

使用数据的考虑

偏见讨论: 某些数据集偏向于最常见的词义。其他偏见的信息不足，需要进一步分析。

其他已知限制:

词义库存通常不完整，因此某些词义可能在plWordNet中缺失
表达新词义的单词和多词术语（在plWordNet中缺失）未被标记

5,000+

优质数据集

54 个

任务类型

进入经典数据集