five

masakhane/masakhaner-x

收藏
Hugging Face2024-11-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/masakhane/masakhaner-x
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - am - bbj - bm - ee - ha - ig - lg - luo - mos - ny - pcm - rw - sn - sw - tn - tw - wo - xh - yo - zu license: - unknown multilinguality: - multilingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: MasakhaNER-X dataset_info: - config_name: am features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 250 - name: test num_examples: 500 - config_name: bbj features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 483 - name: test num_examples: 966 - config_name: bm features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 638 - name: test num_examples: 1000 - config_name: ee features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 501 - name: test num_examples: 1000 - config_name: ha features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 272 - name: test num_examples: 545 - config_name: ig features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 1000 - name: test num_examples: 1000 - config_name: lg features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 906 - name: test num_examples: 1000 - config_name: luo features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 644 - name: validation num_examples: 92 - name: test num_examples: 185 - config_name: mos features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 648 - name: test num_examples: 1000 - config_name: ny features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 893 - name: test num_examples: 1000 - config_name: pcm features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 1000 - name: test num_examples: 1000 - config_name: rw features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 1000 - name: test num_examples: 1000 - config_name: sn features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 887 - name: test num_examples: 1000 - config_name: sw features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 1000 - name: test num_examples: 1000 - config_name: tn features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 499 - name: test num_examples: 996 - config_name: tw features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 605 - name: test num_examples: 1000 - config_name: wo features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 923 - name: test num_examples: 1000 - config_name: xh features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 817 - name: test num_examples: 1000 - config_name: yo features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 1000 - name: test num_examples: 1000 - config_name: zu features: - name: id dtype: string - name: text dtype: string - name: spans sequence: string - name: target dtype: string splits: - name: train num_examples: 1441 - name: validation num_examples: 836 - name: test num_examples: 1000 config_names: - am - bbj - bm - ee - ha - ig - lg - luo - mos - ny - pcm - rw - sn - sw - tn - tw - wo - xh - yo - zu --- # Dataset Card for MasakhaNER ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [homepage](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X) - **Repository:** [github](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X) - **Paper:** [paper](https://aclanthology.org/2022.emnlp-main.298) - **Point of Contact:** [Masakhane](https://www.masakhane.io/) or didelani@lsv.uni-saarland.de ### Dataset Summary MasakhaNER-X is an aggregation of MasakhaNER 1.0 and MasakhaNER 2.0 datasets for 20 African languages. The dataset is not in CoNLL format. The input is the original raw text while the output is byte-level span annotations. Example: {"example_id": "test-00015916", "language": "pcm", "text": "By Bashir Ibrahim Hassan", "spans": [{"start_byte": 3, "limit_byte": 24, "label": "PER"}], "target": "PER: Bashir Ibrahim Hassan"} MasakhaNER-X is a named entity dataset consisting of PER, ORG, LOC, and DATE entities annotated by Masakhane for twenty African languages: - Amharic - Ghomala - Bambara - Ewe - Hausa - Igbo - Kinyarwanda - Luganda - Luo - Mossi - Chichewa - Nigerian-Pidgin - chiShona - Swahili - Setswana - Twi - Wolof - Xhosa - Yoruba - Zulu The train/validation/test sets are available for all the twenty languages. For more details see https://aclanthology.org/2022.emnlp-main.298 ### Supported Tasks and Leaderboards [More Information Needed] - `named-entity-recognition`: The performance in this task is measured with [Span F1](https://github.com/google-research/multilingual-t5/blob/9dcd60fc43c31a8651461f9a21894a134ba22166/multilingual_t5/evaluation/metrics.py#L123) (higher is better). A named entity is correct only if it is an exact match of the corresponding entity in the data. ### Languages There are twenty languages available : - Amharic (am) - Ghomala (bbj) - Bambara (bm) - Ewe (ee) - Hausa (ha) - Igbo (ig) - Kinyarwanda (rw) - Luganda (lg) - Luo (luo) - Mossi (mos) - Chichewa (ny) - Nigerian-Pidgin (pcm) - chiShona (sn) - Swahili (sw) - Setswana (tn) - Twi (tw) - Wolof (wo) - Xhosa (xh) - Yoruba (yo) - Zulu (zu) ## Dataset Structure ### Data Instances The examples look like this for Nigerian-Pidgin: ``` from datasets import load_dataset data = load_dataset('masakhaner-x', 'pcm') # Please, specify the language code # A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. {'id': '0', 'text': "Most of de people who dey opposed to Prez Akufo-Addo en decision say within 3 weeks of lockdown, total number of cases for Ghana rise from around 100 catch 1024.", 'spans': [{"start_byte": 42, "limit_byte": 52, "label": "PER"}, {"start_byte": 76, "limit_byte": 83, "label": "DATE"}, {"start_byte": 123, "limit_byte": 128, "label": "LOC"}] 'target': "PER: Akufo-Addo $$ DATE: 3 weeks $$ LOC: Ghana" } ``` ### Data Fields - `id`: id of the sample - `text`: sentence containing entities - `spans`: details of each named entities in the sentence - `target`: named entities and their values. Each named entity is separated by '$$' The NER tags correspond to this list: ``` "PER", "ORG", "LOC", and "DATE", ``` ### Data Splits For all languages, there are three splits - `train`, `validation` and `test` splits. The splits have the following sizes : | Language | train | validation | test | |-----------------|------:|-----------:|-----:| | Amharic | 1441 | 250 | 500 | | Gbomola | 1441 | 483 | 966 | | Bambara | 1441 | 638 | 1000| | Ewe | 1441 | 501 | 1000| | Hausa | 1441 | 1000| 1000| | Igbo | 1441 | 319 | 638 | | Kinyarwanda | 1441 | 1000| 1000| | Luganda | 1441 | 906 | 1000| | Luo | 644 | 92 | 185 | | Mossi | 1441 | 648 | 1000| | Chichewa | 1441 | 893 | 1000| | Nigerian-Pidgin | 1441 | 1000| 1000| | Shona | 1441 | 887 | 1000| | Swahili | 1441 | 1000| 1000| | Setswana | 1441 | 499 | 996 | | Twi | 1441 | 605 | 1000| | Wolof | 1441 | 923 | 1000| | Xhosa | 1441 | 817 | 1000| | Yoruba | 1441 | 1000| 1000| | Zulu | 1441 | 836 | 1000| ### Licensing Information The licensing status of the data is CC 4.0 Non-Commercial ### Citation Information Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example: ``` @inproceedings{adelani-etal-2022-masakhaner, title = "{M}asakha{NER} 2.0: {A}frica-centric Transfer Learning for Named Entity Recognition", author = "Adelani, David and Neubig, Graham and Ruder, Sebastian and Rijhwani, Shruti and Beukman, Michael and Palen-Michel, Chester and Lignos, Constantine and Alabi, Jesujoba and Muhammad, Shamsuddeen and Nabende, Peter and Dione, Cheikh M. Bamba and Bukula, Andiswa and Mabuya, Rooweither and Dossou, Bonaventure F. P. and Sibanda, Blessing and Buzaaba, Happy and Mukiibi, Jonathan and Kalipe, Godson and Mbaye, Derguene and Taylor, Amelia and Kabore, Fatoumata and Emezue, Chris Chinenye and Aremu, Anuoluwapo and Ogayo, Perez and Gitau, Catherine and Munkoh-Buabeng, Edwin and Memdjokam Koagne, Victoire and Tapo, Allahsera Auguste and Macucwa, Tebogo and Marivate, Vukosi and Elvis, Mboning Tchiaze and Gwadabe, Tajuddeen and Adewumi, Tosin and Ahia, Orevaoghene and Nakatumba-Nabende, Joyce and Mokono, Neo Lerato and Ezeani, Ignatius and Chukwuneke, Chiamaka and Oluwaseun Adeyemi, Mofetoluwa and Hacheme, Gilles Quentin and Abdulmumin, Idris and Ogundepo, Odunayo and Yousuf, Oreen and Moteu, Tatiana and Klakow, Dietrich", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.298", pages = "4488--4508", abstract = "African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14{\%} over 20 languages as compared to using English.", } ```

yaml annotations_creators: - 专家生成 language_creators: - 专家生成 language: - 阿姆哈拉语(am) - 戈马拉语(bbj) - 班巴拉语(bm) - 埃维语(ee) - 豪萨语(ha) - 伊博语(ig) - 卢干达语(lg) - 卢奥语(luo) - 莫西语(mos) - 奇切瓦语(ny) - 尼日利亚皮钦语(pcm) - 卢旺达语(rw) - 奇绍纳语(sn) - 斯瓦希里语(sw) - 茨瓦纳语(tn) - 特维语(tw) - 沃洛夫语(wo) - 科萨语(xh) - 约鲁巴语(yo) - 祖鲁语(zu) license: - 未知 multilinguality: - 多语言 size_categories: - 10000 < 样本量 < 100000 source_datasets: - 原创数据集 task_categories: - 词元分类(Token Classification) task_ids: - 命名实体识别(Named Entity Recognition, NER) pretty_name: MasakhaNER-X dataset_info: - 配置名称: am 特征: - 名称: id 数据类型: 字符串 - 名称: text 数据类型: 字符串 - 名称: spans(实体跨度) 数据类型: 字符串序列 - 名称: target 数据类型: 字符串 数据集划分: - 划分名称: train 样本数: 1441 - 划分名称: validation 样本数: 250 - 划分名称: test 样本数: 500 # 其余19种语言的配置结构与阿姆哈拉语一致,仅样本量存在差异,具体详见下文数据集划分部分 config_names: - am - bbj - bm - ee - ha - ig - lg - luo - mos - ny - pcm - rw - sn - sw - tn - tw - wo - xh - yo - zu # MasakhaNER-X 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持任务与基准测试榜](#supported-tasks-and-leaderboards) - [涉及语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据集划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**:[主页链接](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X) - **代码仓库**:[GitHub仓库](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X) - **相关论文**:[论文链接](https://aclanthology.org/2022.emnlp-main.298) - **联系方式**:[Masakhane](https://www.masakhane.io/) 或 didelani@lsv.uni-saarland.de ### 数据集概览 MasakhaNER-X是针对20种非洲语言的MasakhaNER 1.0与MasakhaNER 2.0数据集的聚合集合。本数据集未采用CoNLL格式,输入为原始原生文本,输出为字节级实体跨度标注。 示例: json {"example_id": "test-00015916", "language": "pcm", "text": "By Bashir Ibrahim Hassan", "spans": [{"start_byte": 3, "limit_byte": 24, "label": "PER"}], "target": "PER: Bashir Ibrahim Hassan"} MasakhaNER-X是一个命名实体识别(Named Entity Recognition, NER)数据集,包含由Masakhane标注的20种非洲语言的**人物(PER, Person)、机构(ORG, Organization)、地点(LOC, Location)与日期(DATE, Date)**实体,涉及语言如下: - 阿姆哈拉语 - 戈马拉语 - 班巴拉语 - 埃维语 - 豪萨语 - 伊博语 - 卢干达语 - 卢奥语 - 莫西语 - 奇切瓦语 - 尼日利亚皮钦语 - 卢旺达语 - 奇绍纳语 - 斯瓦希里语 - 茨瓦纳语 - 特维语 - 沃洛夫语 - 科萨语 - 约鲁巴语 - 祖鲁语 所有20种语言均提供训练集、验证集与测试集。更多细节请参见https://aclanthology.org/2022.emnlp-main.298 ### 支持任务与基准测试榜 [需要补充更多信息] - 命名实体识别(Named Entity Recognition, NER):该任务的性能以**跨度F1(Span F1)**(数值越高,性能越好)衡量。仅当识别出的实体与数据集中对应实体完全匹配时,才判定该命名实体识别结果正确。 ### 涉及语言 本次数据集涵盖20种非洲语言,对应代码如下: - 阿姆哈拉语(am) - 戈马拉语(bbj) - 班巴拉语(bm) - 埃维语(ee) - 豪萨语(ha) - 伊博语(ig) - 卢旺达语(rw) - 卢干达语(lg) - 卢奥语(luo) - 莫西语(mos) - 奇切瓦语(ny) - 尼日利亚皮钦语(pcm) - 奇绍纳语(sn) - 斯瓦希里语(sw) - 茨瓦纳语(tn) - 特维语(tw) - 沃洛夫语(wo) - 科萨语(xh) - 约鲁巴语(yo) - 祖鲁语(zu) ## 数据集结构 ### 数据实例 以尼日利亚皮钦语为例,数据实例格式如下: python from datasets import load_dataset data = load_dataset('masakhaner-x', 'pcm') # 请指定目标语言代码 # 单个数据点由以空行分隔的句子构成,其中标记与标签以制表符分隔 {'id': '0', 'text': "Most of de people who dey opposed to Prez Akufo-Addo en decision say within 3 weeks of lockdown, total number of cases for Ghana rise from around 100 catch 1024.", 'spans': [{"start_byte": 42, "limit_byte": 52, "label": "PER"}, {"start_byte": 76, "limit_byte": 83, "label": "DATE"}, {"start_byte": 123, "limit_byte": 128, "label": "LOC"}], 'target': "PER: Akufo-Addo $$ DATE: 3 weeks $$ LOC: Ghana" } ### 数据字段 - `id`:样本唯一标识符 - `text`:包含命名实体的句子文本 - `spans`:句子中各命名实体的详细信息,包含实体的字节起止位置与标签 - `target`:命名实体及其对应值,各实体间以`$$`分隔 NER标签对应如下列表: "PER", "ORG", "LOC", "DATE" ### 数据集划分 所有语言均包含三个数据集划分:训练集(train)、验证集(validation)与测试集(test)。各划分的样本量如下表所示: | 语言 | 训练集 | 验证集 | 测试集 | |---------------------|-------:|-------:|-------:| | 阿姆哈拉语 | 1441 | 250 | 500 | | 戈马拉语 | 1441 | 483 | 966 | | 班巴拉语 | 1441 | 638 | 1000 | | 埃维语 | 1441 | 501 | 1000 | | 豪萨语 | 1441 | 272 | 545 | | 伊博语 | 1441 | 1000 | 1000 | | 卢干达语 | 1441 | 906 | 1000 | | 卢奥语 | 644 | 92 | 185 | | 莫西语 | 1441 | 648 | 1000 | | 奇切瓦语 | 1441 | 893 | 1000 | | 尼日利亚皮钦语 | 1441 | 1000 | 1000 | | 奇绍纳语 | 1441 | 887 | 1000 | | 斯瓦希里语 | 1441 | 1000 | 1000 | | 茨瓦纳语 | 1441 | 499 | 996 | | 特维语 | 1441 | 605 | 1000 | | 沃洛夫语 | 1441 | 923 | 1000 | | 科萨语 | 1441 | 817 | 1000 | | 约鲁巴语 | 1441 | 1000 | 1000 | | 祖鲁语 | 1441 | 836 | 1000 | *注:原数据集中部分语言的样本量与上述表格存在出入,本翻译采用数据集卡片中正式公布的划分数值* ## 数据集构建 本部分暂无具体内容 ## 数据集使用注意事项 ### 数据集的社会影响 本部分暂无具体内容 ### 偏差讨论 本部分暂无具体内容 ### 其他已知局限性 本部分暂无具体内容 ## 附加信息 ### 数据集整理者 本部分暂无具体内容 ### 许可信息 本数据集的许可协议为CC 4.0 非商业版(CC 4.0 Non-Commercial) ### 引用信息 请使用BibTex格式引用该数据集,示例如下: bibtex @inproceedings{adelani-etal-2022-masakhaner, title = "{M}asakha{NER} 2.0: {A}frica-centric Transfer Learning for Named Entity Recognition", author = "Adelani, David and Neubig, Graham and Ruder, Sebastian and Rijhwani, Shruti and Beukman, Michael and Palen-Michel, Chester and Lignos, Constantine and Alabi, Jesujoba and Muhammad, Shamsuddeen and Nabende, Peter and Dione, Cheikh M. Bamba and Bukula, Andiswa and Mabuya, Rooweither and Dossou, Bonaventure F. P. and Sibanda, Blessing and Buzaaba, Happy and Mukiibi, Jonathan and Kalipe, Godson and Mbaye, Derguene and Taylor, Amelia and Kabore, Fatoumata and Emezue, Chris Chinenye and Aremu, Anuoluwapo and Ogayo, Perez and Gitau, Catherine and Munkoh-Buabeng, Edwin and Memdjokam Koagne, Victoire and Tapo, Allahsera Auguste and Macucwa, Tebogo and Marivate, Vukosi and Elvis, Mboning Tchiaze and Gwadabe, Tajuddeen and Adewumi, Tosin and Ahia, Orevaoghene and Nakatumba-Nabende, Joyce and Mokono, Neo Lerato and Ezeani, Ignatius and Chukwuneke, Chiamaka and Oluwaseun Adeyemi, Mofetoluwa and Hacheme, Gilles Quentin and Abdulmumin, Idris and Ogundepo, Odunayo and Yousuf, Oreen and Moteu, Tatiana and Klakow, Dietrich", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.298", pages = "4488--4508", abstract = "African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.", } ### 贡献 本部分暂无具体内容
提供机构:
masakhane
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作