wmt22_african
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/allenai/wmt22_african
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for allenai/wmt22_african
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://www.statmt.org/wmt22/large-scale-multilingual-translation-task.html
- **Repository:** [Needs More Information]
- **Paper:** [Needs More Information]
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
This dataset was created based on [metadata](https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african) for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the [2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages](https://www.statmt.org/wmt22/large-scale-multilingual-translation-task.html).
#### How to use the data
There are two ways to access the data:
* Via the Hugging Face Python datasets library
```
from datasets import load_dataset
dataset = load_dataset("allenai/wmt22_african")
```
* Clone the git repo
```
git lfs install
git clone https://huggingface.co/datasets/allenai/wmt22_african
```
### Supported Tasks and Leaderboards
This dataset is one of resources allowed under the Constrained Track for the [2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages](https://www.statmt.org/wmt22/large-scale-multilingual-translation-task.html).
### Languages
#### Focus languages
| Language | Code |
| -------- | ---- |
| Afrikaans | afr |
| Amharic | amh |
| Chichewa | nya |
| Nigerian Fulfulde | fuv |
| Hausa | hau |
| Igbo | ibo |
| Kamba | kam |
| Kinyarwanda | kin |
| Lingala | lin |
| Luganda | lug |
| Luo | luo |
| Northern Sotho | nso |
| Oroma | orm |
| Shona | sna |
| Somali | som |
| Swahili | swh |
| Swati | ssw |
| Tswana | tsn |
| Umbundu | umb |
| Wolof | wol |
| Xhosa | xho |
| Xitsonga | tso |
| Yoruba | yor |
| Zulu | zul |
Colonial linguae francae: English - eng, French - fra
## Dataset Structure
The dataset contains gzipped tab delimited text files for each direction. Each text file contains lines with parallel sentences.
### Data Instances
The dataset contains 248 language pairs.
Sentence counts for each pair can be found [here](https://huggingface.co/datasets/allenai/wmt22_african/blob/main/sentence_counts.txt).
### Data Fields
Every instance for a language pair contains the following fields: 'translation' (containing sentence pairs), 'laser_score', 'source_sentence_lid', 'target_sentence_lid', where 'lid' is language classification probability.
Example:
```
{
'translation':
{
'afr': 'In Mei 2007, in ooreenstemming met die spesifikasies van die Java Gemeenskapproses, het Sun Java tegnologie geherlisensieer onder die GNU General Public License.',
'eng': 'As of May 2007, in compliance with the specifications of the Java Community Process, Sun relicensed most of its Java technologies under the GNU General Public License.'
},
'laser_score': 1.0717015266418457,
'source_sentence_lid': 0.9996600151062012,
'target_sentence_lid': 0.9972000122070312
}
```
### Data Splits
The data is not split into train, dev, and test.
## Dataset Creation
### Curation Rationale
Parallel sentences from monolingual data in Common Crawl and ParaCrawl were identified via [Language-Agnostic Sentence Representation (LASER)](https://github.com/facebookresearch/LASER) encoders.
### Source Data
#### Initial Data Collection and Normalization
Monolingual data was obtained from Common Crawl and ParaCrawl.
#### Who are the source language producers?
Contributors to web text in Common Crawl and ParaCrawl.
### Annotations
#### Annotation process
The data was not human annotated. The metadata used to create the dataset can be found here: https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african
#### Who are the annotators?
The data was not human annotated. Parallel text from Common Crawl and Para Crawl monolingual data were identified automatically via [LASER](https://github.com/facebookresearch/LASER) encoders.
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
This dataset provides data for training machine learning systems for many languages that have low resources available for NLP.
### Discussion of Biases
Biases in the data have not been studied.
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
The dataset is released under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the Internet Archive [Terms of Use](https://archive.org/about/terms.php) in respect of the content contained in the dataset.
### Citation Information
NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022.
### Contributions
We thank the AllenNLP team at AI2 for hosting and releasing this data, including [Akshita Bhagia](https://akshitab.github.io/) (for engineering efforts to create the huggingface dataset), and [Jesse Dodge](https://jessedodge.github.io/) (for organizing the connection).
# 数据集卡片:allenai/wmt22_african
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [涉及语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注情况](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://www.statmt.org/wmt22/large-scale-multilingual-translation-task.html
- **代码仓库**:[需补充更多信息]
- **相关论文**:[需补充更多信息]
- **排行榜**:[需补充更多信息]
- **联系人**:[需补充更多信息]
### 数据集概述
本数据集基于Meta AI发布的挖掘得到的双语平行语料元数据构建,元数据来源可参见:https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african。其包含2022年国际机器翻译研讨会(Workshop on Machine Translation, WMT)非洲语言大规模机器翻译评估共享任务所涵盖的248组非洲语言平行语料对。
#### 数据使用方式
可通过以下两种方式获取本数据集:
* 通过Hugging Face Python数据集库
from datasets import load_dataset
dataset = load_dataset("allenai/wmt22_african")
* 克隆Git仓库
git lfs install
git clone https://huggingface.co/datasets/allenai/wmt22_african
### 支持任务与排行榜
本数据集为2022年WMT非洲语言大规模机器翻译评估共享任务约束赛道的允许使用资源之一。
### 涉及语言
#### 核心语言
| 语言名称 | 语言代码 |
| -------- | ---- |
| 南非语 | afr |
| 阿姆哈拉语 | amh |
| 齐切瓦语 | nya |
| 尼日利亚富尔富尔德语 | fuv |
| 豪萨语 | hau |
| 伊博语 | ibo |
| 坎巴语 | kam |
| 卢旺达语 | kin |
| 林加拉语 | lin |
| 卢干达语 | lug |
| 卢奥语 | luo |
| 北索托语 | nso |
| 奥罗莫语 | orm |
| 绍纳语 | sna |
| 索马里语 | som |
| 斯瓦西里语 | swh |
| 斯瓦蒂语 | ssw |
| 茨瓦纳语 | tsn |
| 温邦杜语 | umb |
| 沃洛夫语 | wol |
| 科萨语 | xho |
| 齐松加语 | tso |
| 约鲁巴语 | yor |
| 祖鲁语 | zul |
殖民通用语:英语(eng)、法语(fra)
## 数据集结构
本数据集包含各语言对方向的gzip压缩制表符分隔文本文件,每个文本文件内均存储平行语句行。
### 数据实例
本数据集共涵盖248组语言对。每组语言对的语句数量可参见:https://huggingface.co/datasets/allenai/wmt22_african/blob/main/sentence_counts.txt。
### 数据字段
每组语言对的实例均包含以下字段:`translation`(存储语句对)、`laser_score`、`source_sentence_lid`与`target_sentence_lid`,其中`lid`为语言分类概率。
示例:
{
'translation':
{
'afr': 'In Mei 2007, in ooreenstemming met die spesifikasies van die Java Gemeenskapproses, het Sun Java tegnologie geherlisensieer onder die GNU General Public License.',
'eng': 'As of May 2007, in compliance with the specifications of the Java Community Process, Sun relicensed most of its Java technologies under the GNU General Public License.'
},
'laser_score': 1.0717015266418457,
'source_sentence_lid': 0.9996600151062012,
'target_sentence_lid': 0.9972000122070312
}
### 数据划分
本数据集未划分为训练集、验证集与测试集。
## 数据集构建
### 构建初衷
本数据集通过语言无关句子表示(Language-Agnostic Sentence Representation, LASER)编码器,从Common Crawl与ParaCrawl的单语数据中识别出平行语句。
### 源数据
#### 初始数据收集与标准化
单语数据来源于Common Crawl与ParaCrawl。
#### 源语言文本生产者
源语言文本的贡献者为Common Crawl与ParaCrawl的网页文本贡献者。
### 标注情况
#### 标注流程
本数据集未经过人工标注。构建本数据集所使用的元数据可参见:https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african。
#### 标注者
本数据集未经过人工标注,平行语料是通过语言无关句子表示(LASER)编码器自动从Common Crawl与ParaCrawl的单语数据中识别得到的。
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
本数据集为诸多自然语言处理(Natural Language Processing, NLP)资源匮乏的非洲语言提供了机器学习系统训练所需的数据。
### 偏差讨论
本数据集的数据偏差尚未得到研究。
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
本数据集基于开放数据commons署名许可(ODC-BY)发布。使用本数据集的用户,还需遵守互联网档案馆(Internet Archive)的使用条款,以规范对数据集内包含内容的使用。
### 引用信息
NLLB团队等,《无语言掉队:聚焦人类需求的机器翻译规模化》,arXiv,2022年。
### 致谢
感谢AI2的AllenNLP团队托管并发布本数据集,其中包括Akshita Bhagia(负责构建Hugging Face数据集的工程工作)与Jesse Dodge(负责协调相关对接工作)。
提供机构:
maas
创建时间:
2025-05-27



