five

clarin-knext/wsd_polish_datasets

收藏
Hugging Face2024-02-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/clarin-knext/wsd_polish_datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - pl language_creators: - expert-generated - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: wsd-polish-datasets size_categories: - 1M<n<10M source_datasets: - original tags: [] task_categories: - token-classification task_ids: - word-sense-disambiguation --- # Word Sense Disambiguation Corpora for Polish ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** https://link.springer.com/chapter/10.1007/978-3-031-08754-7_70 - **Point of Contact:** arkadiusz.janz@pwr.edu.pl ### Dataset Summary `WSD Polish Datasets` is a comprehensive benchmark for word sense disambiguation (WSD) classification task in Polish language. It consists of 7 distinct datasets, manually annotated with senses from plWordNet-4.5 sense inventory. The following datasets were annotated and included into our benchmark: - KPWr - KPWr-100 - Sherlock (SPEC) - Skladnica - WikiGlex (a subset of GLEX corpus) - EmoGlex (a subset of GLEX corpus) - Walenty For more details, please check the following publication: ``` @InProceedings{10.1007/978-3-031-08754-7_70, author="Janz, Arkadiusz and Dziob, Agnieszka and Oleksy, Marcin and Baran, Joanna", editor="Groen, Derek and de Mulatier, Cl{\'e}llia and Paszynski, Maciej and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.", title="A Unified Sense Inventory for Word Sense Disambiguation in Polish", booktitle="Computational Science -- ICCS 2022", year="2022", publisher="Springer International Publishing", address="Cham", pages="682--689", isbn="978-3-031-08754-7" } ``` **A new publication on Polish WSD corpora will be available soon** ### Supported Tasks and Leaderboards Word sense disambiguation task. We do not provide a leaderboard. However, we provide an example evaluation script for evaluating WSD models. ### Languages Polish language, PL ## Dataset Structure ### Data Instances Data are structured in JSONL format, each single text sample is divided by sentence. ``` { "text": "Wpierw pani Hudson została zerwana z łóżka, po czym odegrała się na mnie, a ja - na tobie.", "tokens": [ { "index": 0, "position": [ 0, 6 ], "orth": "Wpierw", "lemma": "wpierw", "pos": "adv", "ctag": "adv" }, { "index": 1, "position": [ 7, 11 ], "orth": "pani", "lemma": "pani", "pos": "noun", "ctag": "subst:nom:f:sg" }, { "index": 2, "position": [ 12, 18 ], "orth": "Hudson", "lemma": "Hudson", "pos": "noun", "ctag": "subst:nom:f:sg" }, { "index": 3, "position": [ 19, 26 ], "orth": "została", "lemma": "zostać", "pos": "verb", "ctag": "praet:perf:f:sg" }, { "index": 4, "position": [ 27, 34 ], "orth": "zerwana", "lemma": "zerwać", "pos": "verb", "ctag": "ppas:perf:nom:f:aff:sg" }, <...> ], "phrases": [ { "indices": [ 10, 11 ], "head": 10, "lemma": "odegrać się" } ], "wsd": [ { "index": 0, "pl_sense": "wpierw.1.r", "plWN_syn_id": "01a4a067-aac5-11ed-aae5-0242ac130002", "plWN_lex_id": "f2757c30-aac4-11ed-aae5-0242ac130002", "plWN_syn_legacy_id": "477654", "plWN_lex_legacy_id": "718454", "PWN_syn_id": "00102736-r", "bn_syn_id": "bn:00115376r", "mapping_relation": "synonymy" }, { "index": 1, "pl_sense": "pani.2.n", "plWN_syn_id": "f35fb1ed-aac4-11ed-aae5-0242ac130002", "plWN_lex_id": "d5145565-aac4-11ed-aae5-0242ac130002", "plWN_syn_legacy_id": "129", "plWN_lex_legacy_id": "20695", "PWN_syn_id": "10787470-n", "bn_syn_id": "bn:00001530n", "mapping_relation": "synonymy" }, <...> ] } ``` ### Data Fields Description of json keys: - `text`: text of the sentence - `tokens`: list of tokens made by tokenization process - `index`: token order index in sentence - `position`: token chars span indices <included, excluded> - `orth`: word - `lemma`: lemmatised word - `pos`: part of speech - `ctag`: morphosyntactic tag - `phrases`: list of multi-word - `wsd`: annotation labels for the WSD task ### Data Splits We do not specify an exact data split for training and evaluation. However, we suggest to use GLEX and Składnica for training and other datasets for testing. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection, Normalization and Post-processing Source corpora were initially pre-processed using morphosyntactic tagging and multi-word expression recognition tools. To tokenize and tag the datasets we used [MorphoDiTa](https://clarin-pl.eu/dspace/handle/11321/425) adapted to Polish language. To recognize multi-word expressions we applied pattern-based matching tool [Corpus2-MWE](https://clarin-pl.eu/dspace/handle/11321/533) - only MWEs from plWordNet were included. After manual annotation, sense indices of plWordNet 4.5 were mapped automatically to Princeton WordNet 3.0 and BabelNet 4.0 indices using plWordNet's interlingual mapping. ### Annotations #### Annotation process * 2+1 annotation process with inter-annotator agreement score over 0.6 PSA * annotated with [plWordNet 4.5](http://plwordnet.pwr.wroc.pl/wordnet/) * software: [WordNet-Loom](https://clarin-pl.eu/dspace/handle/11321/275) and [Inforex](https://clarin-pl.eu/dspace/handle/11321/13) * both single-word and multi-word expressions annotated * full-text sense annotation (excluding KPWr) #### Who are the annotators? - professional linguists from CLARIN-PL project ### Personal and Sensitive Information The datasets do not contain any personal or sensitive information. ## Considerations for Using the Data ### Discussion of Biases Some datasets are biased towards most frequent senses. No information about other biases - needs further analysis. ### Other Known Limitations * sense inventories are usually incomplete therefore some word senses might be missing in plWordNet * single-word and multi-word terms expressing novel senses (missing in plWordNet) were not marked ## Additional Information ### Dataset Curators Arkadiusz Janz (arkadiusz.janz@pwr.edu.pl) ### Licensing Information KPWR-100 [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) KPWR [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) Walenty [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) Sherlock [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) Skladnica [GNU GPL 3](http://www.gnu.org/licenses/gpl-3.0.en.html) GLEX [plWordNet License](http://plwordnet.pwr.wroc.pl/wordnet/licence) ### Citation Information Main source (all corpora as a unified benchmark) and published here on HuggingFace: ```` @InProceedings{10.1007/978-3-031-08754-7_70, author="Janz, Arkadiusz and Dziob, Agnieszka and Oleksy, Marcin and Baran, Joanna", editor="Groen, Derek and de Mulatier, Cl{\'e}llia and Paszynski, Maciej and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.", title="A Unified Sense Inventory for Word Sense Disambiguation in Polish", booktitle="Computational Science -- ICCS 2022", year="2022", publisher="Springer International Publishing", address="Cham", pages="682--689", isbn="978-3-031-08754-7" } ```` Related work ------------ KPWr-100, Składnica, SPEC ```` @article{janzresults, title={Results of the PolEval 2020 Shared Task 3: Word Sense Disambiguation}, author={Janz, Arkadiusz and Chlebus, Joanna and Dziob, Agnieszka and Piasecki, Maciej}, journal={Proceedings of the PolEval 2020 Workshop}, pages={65--77}, year={2020} } ```` GLEX (EmoGLEX) ```` @article{janz2017plwordnet, title={{plWordNet} as a basis for large emotive lexicons of Polish}, author={Janz, Arkadiusz and Kocon, Jan and Piasecki, Maciej and Zasko-Zielinska, Monika}, journal={Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics Poznan: Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu}, pages={189--193}, year={2017} } ```` KPWr ```` @conference{broda2012, address = {Istanbul, Turkey}, author = {Bartosz Broda and Micha{\l} Marci{\'n}czuk and Marek Maziarz and Adam Radziszewski and Adam Wardy{\'n}ski}, booktitle = {Proceedings of LREC'12}, owner = {Marlena}, publisher = {ELRA}, timestamp = {2014.06.20}, title = {KPWr: Towards a Free Corpus of Polish}, year = {2012} } ```` Składnica ```` @inproceedings{hajnicz-2014-lexico, title = "Lexico-Semantic Annotation of Sk{\l}adnica Treebank by means of {PLWN} Lexical Units", author = "Hajnicz, El{\.z}bieta", booktitle = "Proceedings of the Seventh Global {W}ordnet Conference", month = jan, year = "2014", address = "Tartu, Estonia", publisher = "University of Tartu Press", url = "https://aclanthology.org/W14-0104", pages = "23--31", } ```` Walenty ```` @inproceedings{haj:and:bar:lrec16, author = {Hajnicz, El{\.z}bieta and Andrzejczuk, Anna and Bartosiak, Tomasz}, crossref = {lrec:16}, pages = {2625--2632}, pdf = {http://www.lrec-conf.org/proceedings/lrec2016/pdf/382_Paper.pdf}, title = {Semantic Layer of the Valence Dictionary of {P}olish \emph{{W}alenty}} } ```` Mapping plWordNet onto Princeton WordNet ```` @inproceedings{rudnicka-etal-2021-non, title = "A (Non)-Perfect Match: Mapping pl{W}ord{N}et onto {P}rinceton{W}ord{N}et", author = "Rudnicka, Ewa and Witkowski, Wojciech and Piasecki, Maciej", booktitle = "Proceedings of the 11th Global Wordnet Conference", month = jan, year = "2021", address = "University of South Africa (UNISA)", publisher = "Global Wordnet Association", url = "https://aclanthology.org/2021.gwc-1.16", pages = "137--146" } ````
提供机构:
clarin-knext
原始信息汇总

数据集概述

名称: WSD Polish Datasets

语言: 波兰语 (pl)

许可证: CC-BY-4.0

多语言性: 单语

大小: 1M<n<10M

源数据: 原始数据

任务类别: 词性标注

任务ID: 词义消歧

数据集内容

数据集摘要: WSD Polish Datasets 是一个针对波兰语词义消歧(WSD)分类任务的综合基准。它包含7个不同的数据集,手动标注了来自plWordNet-4.5的词义。包括以下数据集:

  • KPWr
  • KPWr-100
  • Sherlock (SPEC)
  • Skladnica
  • WikiGlex (GLEX语料库的子集)
  • EmoGlex (GLEX语料库的子集)
  • Walenty

支持的任务和排行榜: 词义消歧任务。不提供排行榜,但提供了一个示例评估脚本用于评估WSD模型。

数据结构:

  • 数据实例: 数据以JSONL格式组织,每个文本样本按句子分割。
  • 数据字段:
    • text: 句子文本
    • tokens: 通过分词过程生成的词列表
      • index: 词在句子中的顺序索引
      • position: 词的字符跨度索引
      • orth: 词
      • lemma: 词形还原后的词
      • pos: 词性
      • ctag: 形态句法标签
    • phrases: 多词表达的列表
    • wsd: WSD任务的标注标签

数据分割: 未指定确切的数据分割用于训练和评估,但建议使用GLEX和Składnica进行训练,其他数据集用于测试。

数据集创建

源数据: 源语料库经过形态句法标注和多词表达识别工具的预处理。使用MorphoDiTa进行分词和标注,使用Corpus2-MWE识别多词表达。

标注过程:

  • 2+1标注过程,注释者间一致性分数超过0.6 PSA
  • 使用plWordNet 4.5进行标注
  • 使用WordNet-LoomInforex软件
  • 标注了单词和多词表达
  • 全文词义标注(KPWr除外)

注释者: 来自CLARIN-PL项目的专业语言学家。

使用数据的考虑

偏见讨论: 某些数据集偏向于最常见的词义。其他偏见的信息不足,需要进一步分析。

其他已知限制:

  • 词义库存通常不完整,因此某些词义可能在plWordNet中缺失
  • 表达新词义的单词和多词术语(在plWordNet中缺失)未被标记
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作