masakhane/masakhaner-x
收藏Hugging Face2024-11-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/masakhane/masakhaner-x
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- am
- bbj
- bm
- ee
- ha
- ig
- lg
- luo
- mos
- ny
- pcm
- rw
- sn
- sw
- tn
- tw
- wo
- xh
- yo
- zu
license:
- unknown
multilinguality:
- multilingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- token-classification
task_ids:
- named-entity-recognition
pretty_name: MasakhaNER-X
dataset_info:
- config_name: am
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 250
- name: test
num_examples: 500
- config_name: bbj
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 483
- name: test
num_examples: 966
- config_name: bm
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 638
- name: test
num_examples: 1000
- config_name: ee
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 501
- name: test
num_examples: 1000
- config_name: ha
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 272
- name: test
num_examples: 545
- config_name: ig
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
- config_name: lg
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 906
- name: test
num_examples: 1000
- config_name: luo
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 644
- name: validation
num_examples: 92
- name: test
num_examples: 185
- config_name: mos
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 648
- name: test
num_examples: 1000
- config_name: ny
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 893
- name: test
num_examples: 1000
- config_name: pcm
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
- config_name: rw
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
- config_name: sn
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 887
- name: test
num_examples: 1000
- config_name: sw
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
- config_name: tn
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 499
- name: test
num_examples: 996
- config_name: tw
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 605
- name: test
num_examples: 1000
- config_name: wo
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 923
- name: test
num_examples: 1000
- config_name: xh
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 817
- name: test
num_examples: 1000
- config_name: yo
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
- config_name: zu
features:
- name: id
dtype: string
- name: text
dtype: string
- name: spans
sequence: string
- name: target
dtype: string
splits:
- name: train
num_examples: 1441
- name: validation
num_examples: 836
- name: test
num_examples: 1000
config_names:
- am
- bbj
- bm
- ee
- ha
- ig
- lg
- luo
- mos
- ny
- pcm
- rw
- sn
- sw
- tn
- tw
- wo
- xh
- yo
- zu
---
# Dataset Card for MasakhaNER
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [homepage](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X)
- **Repository:** [github](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X)
- **Paper:** [paper](https://aclanthology.org/2022.emnlp-main.298)
- **Point of Contact:** [Masakhane](https://www.masakhane.io/) or didelani@lsv.uni-saarland.de
### Dataset Summary
MasakhaNER-X is an aggregation of MasakhaNER 1.0 and MasakhaNER 2.0 datasets for 20 African languages. The dataset is not in CoNLL format. The input is the original raw text while the output is byte-level span annotations.
Example:
{"example_id": "test-00015916", "language": "pcm", "text": "By Bashir Ibrahim Hassan", "spans": [{"start_byte": 3, "limit_byte": 24, "label": "PER"}], "target": "PER: Bashir Ibrahim Hassan"}
MasakhaNER-X is a named entity dataset consisting of PER, ORG, LOC, and DATE entities annotated by Masakhane for twenty African languages:
- Amharic
- Ghomala
- Bambara
- Ewe
- Hausa
- Igbo
- Kinyarwanda
- Luganda
- Luo
- Mossi
- Chichewa
- Nigerian-Pidgin
- chiShona
- Swahili
- Setswana
- Twi
- Wolof
- Xhosa
- Yoruba
- Zulu
The train/validation/test sets are available for all the twenty languages.
For more details see https://aclanthology.org/2022.emnlp-main.298
### Supported Tasks and Leaderboards
[More Information Needed]
- `named-entity-recognition`: The performance in this task is measured with [Span F1](https://github.com/google-research/multilingual-t5/blob/9dcd60fc43c31a8651461f9a21894a134ba22166/multilingual_t5/evaluation/metrics.py#L123) (higher is better). A named entity is correct only if it is an exact match of the corresponding entity in the data.
### Languages
There are twenty languages available :
- Amharic (am)
- Ghomala (bbj)
- Bambara (bm)
- Ewe (ee)
- Hausa (ha)
- Igbo (ig)
- Kinyarwanda (rw)
- Luganda (lg)
- Luo (luo)
- Mossi (mos)
- Chichewa (ny)
- Nigerian-Pidgin (pcm)
- chiShona (sn)
- Swahili (sw)
- Setswana (tn)
- Twi (tw)
- Wolof (wo)
- Xhosa (xh)
- Yoruba (yo)
- Zulu (zu)
## Dataset Structure
### Data Instances
The examples look like this for Nigerian-Pidgin:
```
from datasets import load_dataset
data = load_dataset('masakhaner-x', 'pcm')
# Please, specify the language code
# A data point consists of sentences seperated by empty line and tab-seperated tokens and tags.
{'id': '0',
'text': "Most of de people who dey opposed to Prez Akufo-Addo en decision say within 3 weeks of lockdown, total number of cases for Ghana rise from around 100 catch 1024.",
'spans': [{"start_byte": 42, "limit_byte": 52, "label": "PER"}, {"start_byte": 76, "limit_byte": 83, "label": "DATE"}, {"start_byte": 123, "limit_byte": 128, "label": "LOC"}]
'target': "PER: Akufo-Addo $$ DATE: 3 weeks $$ LOC: Ghana"
}
```
### Data Fields
- `id`: id of the sample
- `text`: sentence containing entities
- `spans`: details of each named entities in the sentence
- `target`: named entities and their values. Each named entity is separated by '$$'
The NER tags correspond to this list:
```
"PER", "ORG", "LOC", and "DATE",
```
### Data Splits
For all languages, there are three splits - `train`, `validation` and `test` splits.
The splits have the following sizes :
| Language | train | validation | test |
|-----------------|------:|-----------:|-----:|
| Amharic | 1441 | 250 | 500 |
| Gbomola | 1441 | 483 | 966 |
| Bambara | 1441 | 638 | 1000|
| Ewe | 1441 | 501 | 1000|
| Hausa | 1441 | 1000| 1000|
| Igbo | 1441 | 319 | 638 |
| Kinyarwanda | 1441 | 1000| 1000|
| Luganda | 1441 | 906 | 1000|
| Luo | 644 | 92 | 185 |
| Mossi | 1441 | 648 | 1000|
| Chichewa | 1441 | 893 | 1000|
| Nigerian-Pidgin | 1441 | 1000| 1000|
| Shona | 1441 | 887 | 1000|
| Swahili | 1441 | 1000| 1000|
| Setswana | 1441 | 499 | 996 |
| Twi | 1441 | 605 | 1000|
| Wolof | 1441 | 923 | 1000|
| Xhosa | 1441 | 817 | 1000|
| Yoruba | 1441 | 1000| 1000|
| Zulu | 1441 | 836 | 1000|
### Licensing Information
The licensing status of the data is CC 4.0 Non-Commercial
### Citation Information
Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example:
```
@inproceedings{adelani-etal-2022-masakhaner,
title = "{M}asakha{NER} 2.0: {A}frica-centric Transfer Learning for Named Entity Recognition",
author = "Adelani, David and
Neubig, Graham and
Ruder, Sebastian and
Rijhwani, Shruti and
Beukman, Michael and
Palen-Michel, Chester and
Lignos, Constantine and
Alabi, Jesujoba and
Muhammad, Shamsuddeen and
Nabende, Peter and
Dione, Cheikh M. Bamba and
Bukula, Andiswa and
Mabuya, Rooweither and
Dossou, Bonaventure F. P. and
Sibanda, Blessing and
Buzaaba, Happy and
Mukiibi, Jonathan and
Kalipe, Godson and
Mbaye, Derguene and
Taylor, Amelia and
Kabore, Fatoumata and
Emezue, Chris Chinenye and
Aremu, Anuoluwapo and
Ogayo, Perez and
Gitau, Catherine and
Munkoh-Buabeng, Edwin and
Memdjokam Koagne, Victoire and
Tapo, Allahsera Auguste and
Macucwa, Tebogo and
Marivate, Vukosi and
Elvis, Mboning Tchiaze and
Gwadabe, Tajuddeen and
Adewumi, Tosin and
Ahia, Orevaoghene and
Nakatumba-Nabende, Joyce and
Mokono, Neo Lerato and
Ezeani, Ignatius and
Chukwuneke, Chiamaka and
Oluwaseun Adeyemi, Mofetoluwa and
Hacheme, Gilles Quentin and
Abdulmumin, Idris and
Ogundepo, Odunayo and
Yousuf, Oreen and
Moteu, Tatiana and
Klakow, Dietrich",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.298",
pages = "4488--4508",
abstract = "African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14{\%} over 20 languages as compared to using English.",
}
```
yaml
annotations_creators:
- 专家生成
language_creators:
- 专家生成
language:
- 阿姆哈拉语(am)
- 戈马拉语(bbj)
- 班巴拉语(bm)
- 埃维语(ee)
- 豪萨语(ha)
- 伊博语(ig)
- 卢干达语(lg)
- 卢奥语(luo)
- 莫西语(mos)
- 奇切瓦语(ny)
- 尼日利亚皮钦语(pcm)
- 卢旺达语(rw)
- 奇绍纳语(sn)
- 斯瓦希里语(sw)
- 茨瓦纳语(tn)
- 特维语(tw)
- 沃洛夫语(wo)
- 科萨语(xh)
- 约鲁巴语(yo)
- 祖鲁语(zu)
license:
- 未知
multilinguality:
- 多语言
size_categories:
- 10000 < 样本量 < 100000
source_datasets:
- 原创数据集
task_categories:
- 词元分类(Token Classification)
task_ids:
- 命名实体识别(Named Entity Recognition, NER)
pretty_name: MasakhaNER-X
dataset_info:
- 配置名称: am
特征:
- 名称: id
数据类型: 字符串
- 名称: text
数据类型: 字符串
- 名称: spans(实体跨度)
数据类型: 字符串序列
- 名称: target
数据类型: 字符串
数据集划分:
- 划分名称: train
样本数: 1441
- 划分名称: validation
样本数: 250
- 划分名称: test
样本数: 500
# 其余19种语言的配置结构与阿姆哈拉语一致,仅样本量存在差异,具体详见下文数据集划分部分
config_names:
- am
- bbj
- bm
- ee
- ha
- ig
- lg
- luo
- mos
- ny
- pcm
- rw
- sn
- sw
- tn
- tw
- wo
- xh
- yo
- zu
# MasakhaNER-X 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [涉及语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据集划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:[主页链接](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X)
- **代码仓库**:[GitHub仓库](https://github.com/masakhane-io/masakhane-ner/tree/main/xtreme-up/MasakhaNER-X)
- **相关论文**:[论文链接](https://aclanthology.org/2022.emnlp-main.298)
- **联系方式**:[Masakhane](https://www.masakhane.io/) 或 didelani@lsv.uni-saarland.de
### 数据集概览
MasakhaNER-X是针对20种非洲语言的MasakhaNER 1.0与MasakhaNER 2.0数据集的聚合集合。本数据集未采用CoNLL格式,输入为原始原生文本,输出为字节级实体跨度标注。
示例:
json
{"example_id": "test-00015916", "language": "pcm", "text": "By Bashir Ibrahim Hassan", "spans": [{"start_byte": 3, "limit_byte": 24, "label": "PER"}], "target": "PER: Bashir Ibrahim Hassan"}
MasakhaNER-X是一个命名实体识别(Named Entity Recognition, NER)数据集,包含由Masakhane标注的20种非洲语言的**人物(PER, Person)、机构(ORG, Organization)、地点(LOC, Location)与日期(DATE, Date)**实体,涉及语言如下:
- 阿姆哈拉语
- 戈马拉语
- 班巴拉语
- 埃维语
- 豪萨语
- 伊博语
- 卢干达语
- 卢奥语
- 莫西语
- 奇切瓦语
- 尼日利亚皮钦语
- 卢旺达语
- 奇绍纳语
- 斯瓦希里语
- 茨瓦纳语
- 特维语
- 沃洛夫语
- 科萨语
- 约鲁巴语
- 祖鲁语
所有20种语言均提供训练集、验证集与测试集。更多细节请参见https://aclanthology.org/2022.emnlp-main.298
### 支持任务与基准测试榜
[需要补充更多信息]
- 命名实体识别(Named Entity Recognition, NER):该任务的性能以**跨度F1(Span F1)**(数值越高,性能越好)衡量。仅当识别出的实体与数据集中对应实体完全匹配时,才判定该命名实体识别结果正确。
### 涉及语言
本次数据集涵盖20种非洲语言,对应代码如下:
- 阿姆哈拉语(am)
- 戈马拉语(bbj)
- 班巴拉语(bm)
- 埃维语(ee)
- 豪萨语(ha)
- 伊博语(ig)
- 卢旺达语(rw)
- 卢干达语(lg)
- 卢奥语(luo)
- 莫西语(mos)
- 奇切瓦语(ny)
- 尼日利亚皮钦语(pcm)
- 奇绍纳语(sn)
- 斯瓦希里语(sw)
- 茨瓦纳语(tn)
- 特维语(tw)
- 沃洛夫语(wo)
- 科萨语(xh)
- 约鲁巴语(yo)
- 祖鲁语(zu)
## 数据集结构
### 数据实例
以尼日利亚皮钦语为例,数据实例格式如下:
python
from datasets import load_dataset
data = load_dataset('masakhaner-x', 'pcm')
# 请指定目标语言代码
# 单个数据点由以空行分隔的句子构成,其中标记与标签以制表符分隔
{'id': '0',
'text': "Most of de people who dey opposed to Prez Akufo-Addo en decision say within 3 weeks of lockdown, total number of cases for Ghana rise from around 100 catch 1024.",
'spans': [{"start_byte": 42, "limit_byte": 52, "label": "PER"}, {"start_byte": 76, "limit_byte": 83, "label": "DATE"}, {"start_byte": 123, "limit_byte": 128, "label": "LOC"}],
'target': "PER: Akufo-Addo $$ DATE: 3 weeks $$ LOC: Ghana"
}
### 数据字段
- `id`:样本唯一标识符
- `text`:包含命名实体的句子文本
- `spans`:句子中各命名实体的详细信息,包含实体的字节起止位置与标签
- `target`:命名实体及其对应值,各实体间以`$$`分隔
NER标签对应如下列表:
"PER", "ORG", "LOC", "DATE"
### 数据集划分
所有语言均包含三个数据集划分:训练集(train)、验证集(validation)与测试集(test)。各划分的样本量如下表所示:
| 语言 | 训练集 | 验证集 | 测试集 |
|---------------------|-------:|-------:|-------:|
| 阿姆哈拉语 | 1441 | 250 | 500 |
| 戈马拉语 | 1441 | 483 | 966 |
| 班巴拉语 | 1441 | 638 | 1000 |
| 埃维语 | 1441 | 501 | 1000 |
| 豪萨语 | 1441 | 272 | 545 |
| 伊博语 | 1441 | 1000 | 1000 |
| 卢干达语 | 1441 | 906 | 1000 |
| 卢奥语 | 644 | 92 | 185 |
| 莫西语 | 1441 | 648 | 1000 |
| 奇切瓦语 | 1441 | 893 | 1000 |
| 尼日利亚皮钦语 | 1441 | 1000 | 1000 |
| 奇绍纳语 | 1441 | 887 | 1000 |
| 斯瓦希里语 | 1441 | 1000 | 1000 |
| 茨瓦纳语 | 1441 | 499 | 996 |
| 特维语 | 1441 | 605 | 1000 |
| 沃洛夫语 | 1441 | 923 | 1000 |
| 科萨语 | 1441 | 817 | 1000 |
| 约鲁巴语 | 1441 | 1000 | 1000 |
| 祖鲁语 | 1441 | 836 | 1000 |
*注:原数据集中部分语言的样本量与上述表格存在出入,本翻译采用数据集卡片中正式公布的划分数值*
## 数据集构建
本部分暂无具体内容
## 数据集使用注意事项
### 数据集的社会影响
本部分暂无具体内容
### 偏差讨论
本部分暂无具体内容
### 其他已知局限性
本部分暂无具体内容
## 附加信息
### 数据集整理者
本部分暂无具体内容
### 许可信息
本数据集的许可协议为CC 4.0 非商业版(CC 4.0 Non-Commercial)
### 引用信息
请使用BibTex格式引用该数据集,示例如下:
bibtex
@inproceedings{adelani-etal-2022-masakhaner,
title = "{M}asakha{NER} 2.0: {A}frica-centric Transfer Learning for Named Entity Recognition",
author = "Adelani, David and
Neubig, Graham and
Ruder, Sebastian and
Rijhwani, Shruti and
Beukman, Michael and
Palen-Michel, Chester and
Lignos, Constantine and
Alabi, Jesujoba and
Muhammad, Shamsuddeen and
Nabende, Peter and
Dione, Cheikh M. Bamba and
Bukula, Andiswa and
Mabuya, Rooweither and
Dossou, Bonaventure F. P. and
Sibanda, Blessing and
Buzaaba, Happy and
Mukiibi, Jonathan and
Kalipe, Godson and
Mbaye, Derguene and
Taylor, Amelia and
Kabore, Fatoumata and
Emezue, Chris Chinenye and
Aremu, Anuoluwapo and
Ogayo, Perez and
Gitau, Catherine and
Munkoh-Buabeng, Edwin and
Memdjokam Koagne, Victoire and
Tapo, Allahsera Auguste and
Macucwa, Tebogo and
Marivate, Vukosi and
Elvis, Mboning Tchiaze and
Gwadabe, Tajuddeen and
Adewumi, Tosin and
Ahia, Orevaoghene and
Nakatumba-Nabende, Joyce and
Mokono, Neo Lerato and
Ezeani, Ignatius and
Chukwuneke, Chiamaka and
Oluwaseun Adeyemi, Mofetoluwa and
Hacheme, Gilles Quentin and
Abdulmumin, Idris and
Ogundepo, Odunayo and
Yousuf, Oreen and
Moteu, Tatiana and
Klakow, Dietrich",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.298",
pages = "4488--4508",
abstract = "African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.",
}
### 贡献
本部分暂无具体内容
提供机构:
masakhane



