masakhane/masakhapos

Name: masakhane/masakhapos
Creator: masakhane
Published: 2024-02-05 11:09:55
License: 暂无描述

Hugging Face2024-02-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/masakhane/masakhapos

下载链接

链接失效反馈

官方服务：

资源简介：

MasakhaPOS是最大的公开高质量数据集，用于20种非洲语言的词性标注（POS）。这些语言包括班巴拉语、格马拉语、埃维语、丰语、豪萨语、伊博语、基尼亚卢旺达语、卢干达语、卢奥语、莫西语、奇切瓦语、尼日利亚皮钦语、绍纳语、斯瓦希里语、茨瓦纳语、特威语、沃洛夫语、科萨语、约鲁巴语和祖鲁语。数据集包含训练集、验证集和测试集，适用于所有20种语言。数据来源于新闻领域，并由专家生成的注释。

提供机构：

masakhane

原始信息汇总

数据集概述

数据集描述

数据集摘要

MasakhaPOS 是公开可用的最大的高质量非洲语言词性标注（POS）数据集，涵盖 20 种非洲语言。训练/验证/测试集对所有 20 种语言都可用。

支持的任务和排行榜

词性标注：该任务的性能通过准确率来衡量（越高越好）。

语言

数据集包含 20 种语言：

Bambara (bam)
Ghomala (bbj)
Ewe (ewe)
Fon (fon)
Hausa (hau)
Igbo (ibo)
Kinyarwanda (kin)
Luganda (lug)
Dholuo (luo)
Mossi (mos)
Chichewa (nya)
Nigerian Pidgin
chiShona (sna)
Kiswahili (swą)
Setswana (tsn)
Twi (twi)
Wolof (wol)
isiXhosa (xho)
Yorùbá (yor)
isiZulu (zul)

数据集结构

数据实例

数据点由空行分隔的句子以及制表符分隔的标记和标签组成。例如，Yorùbá 语言的数据点如下： python {id: 0, ner_tags: [0, 10, 10, 16, 0, 14, 0, 16, 0], tokens: [Ọ̀gbẹ́ni, Nuhu, Adam, kúrò, nípò, bí, ẹní, yọ, jìgá]}

数据字段

id：样本的 ID
tokens：示例文本的标记
upos：每个标记的词性标签

词性标签对应以下列表：

"NOUN", "PUNCT", "ADP", "NUM", "SYM", "SCONJ", "ADJ", "PART", "DET", "CCONJ", "PROPN", "PRON", "X", "ADV", "INTJ", "VERB", "AUX"

标签的定义可以在 UD 网站找到。

数据分割

所有语言都有三个分割：

train：训练集
dev：验证集
test：测试集

各语言的分割大小如下：

语言	训练集	验证集	测试集
Bambara	775	154	619
Ghomala	750	149	599
Ewe	728	145	582
Fon	810	161	646
Hausa	753	150	601
Igbo	803	160	642
Kinyarwanda	757	151	604
Luganda	733	146	586
Luo	758	151	606
Mossi	757	151	604
Chichewa	728	145	582
Nigerian-Pidgin	752	150	600
chiShona	747	149	596
Kiswahili	693	138	553
Setswana	754	150	602
Akan/Twi	785	157	628
Wolof	782	156	625
isiXhosa	752	150	601
Yoruba	893	178	713
isiZulu	753	150	601

数据集创建

策划理由

该数据集旨在为 20 种服务不足的自然语言处理语言引入新资源。

源数据

数据来源是新闻领域，具体细节可以在这里找到。

注释

注释过程的详细信息可以在这里找到。注释者是从 Masakhane 招募的。

个人和敏感信息

数据来源于报纸，仅包含公众人物或个人的提及。

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

用户应注意，数据集仅包含新闻文本，这可能限制开发系统在其他领域的适用性。

附加信息

数据集策展人

[更多信息需要]

许可信息

数据的许可状态是 CC 4.0 非商业性。

引用信息

提供数据集的 BibTex 格式引用：

@inproceedings{dione-etal-2023-masakhapos, title = "{M}asakha{POS}: Part-of-Speech Tagging for Typologically Diverse {A}frican languages", author = "Dione, Cheikh M. Bamba and Adelani, David Ifeoluwa and Nabende, Peter and Alabi, Jesujoba and Sindane, Thapelo and Buzaaba, Happy and Muhammad, Shamsuddeen Hassan and Emezue, Chris Chinenye and Ogayo, Perez and Aremu, Anuoluwapo and Gitau, Catherine and Mbaye, Derguene and Mukiibi, Jonathan and Sibanda, Blessing and Dossou, Bonaventure F. P. and Bukula, Andiswa and Mabuya, Rooweither and Tapo, Allahsera Auguste and Munkoh-Buabeng, Edwin and Memdjokam Koagne, Victoire and Ouoba Kabore, Fatoumata and Taylor, Amelia and Kalipe, Godson and Macucwa, Tebogo and Marivate, Vukosi and Gwadabe, Tajuddeen and Elvis, Mboning Tchiaze and Onyenwe, Ikechukwu and Atindogbe, Gratien and Adelani, Tolulope and Akinade, Idris and Samuel, Olanrewaju and Nahimana, Marien and Musabeyezu, Th{e}og{`e}ne and Niyomutabazi, Emile and Chimhenga, Ester and Gotosa, Kudzai and Mizha, Patrick and Agbolo, Apelete and Traore, Seydou and Uchechukwu, Chinedu and Yusuf, Aliyu and Abdullahi, Muhammad and Klakow, Dietrich", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.609", doi = "10.18653/v1/2023.acl-long.609", pages = "10883--10900", abstract = "In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages.", }

贡献

感谢 @dadelani 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集