aya_dataset
收藏魔搭社区2026-01-02 更新2024-12-21 收录
下载链接:
https://modelscope.cn/datasets/CohereForAI/aya_dataset
下载链接
链接失效反馈官方服务:
资源简介:

# Dataset Summary
The `Aya Dataset` is a multilingual instruction fine-tuning dataset curated by an open-science community via [Aya Annotation Platform](https://aya.for.ai/) from Cohere Labs. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators.<br>
This dataset can be used to train, finetune, and evaluate multilingual LLMs.
- **Curated by:** Contributors of [Aya Open Science Intiative](https://aya.for.ai/).
- **Language(s):** 65 languages (71 including dialects & scripts).
- **License:** [Apache 2.0](https://opensource.org/license/apache-2-0)
- **Aya Datasets Family:**
| Name | Explanation |
|------|--------------|
| [aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) | Human-annotated multilingual instruction finetuning dataset, comprising over 204K instances across 65 languages. |
| [aya_collection](https://huggingface.co/datasets/CohereLabs/aya_collection) | Created by applying instruction-style templates from fluent speakers to 44 datasets, including translations of 19 instruction-style datasets into 101 languages, providing 513M instances for various tasks.|
| [aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) | Aya Collection structured based on language level subsets. |
| [aya_evaluation_suite](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite) | A diverse evaluation set for multilingual open-ended generation, featuring 250 culturally grounded prompts in 7 languages, 200 translated prompts in 24 languages, and human-edited versions selected for cross-cultural relevance from English Dolly in 6 languages.|
| [aya_redteaming](https://huggingface.co/datasets/CohereLabs/aya_redteaming)| A red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.|
# Dataset
The `Aya Dataset` comprises of two types of data:
1. **Human Annotations:** Original annotations (brand new prompts and completions written by annotators) and re-annotations (human edits of automatically generated prompts and completions).
2. **Demographics Data:** Anonymized information for each annotator.
## Load with Datasets
To load this dataset consisting of both prompt-completions and demographics data with `datasets`, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
# Load the annotations dataset
aya_dataset = load_dataset("CohereLabs/aya_dataset")
# Load the demographics dataset
aya_demographics = load_dataset("CohereLabs/aya_dataset", "demographics")
```
## Data Fields
### Human Annotations (Default)
The data fields are the same among all splits:
- `inputs`: Prompt or input to the language model.
- `targets`: Completion or output of the language model.
- `language`: The language of the `inputs` and `targets`.
- `language_code`: The ISO code for the language of the `inputs` and `targets`.
- `annotation_type`: The value denoting whether `inputs` and `targets` are 'original_annotations' or 're-annotations'.
- `user_id`: Unique identifier of the annotator who submitted the prompt-completion pair.
### Demographics Data
The data fields are the same among all splits:
- `user_id`: Unique identifier of the annotator who submitted the prompt-completion pair.
- `age_range`: Age of the annotator. Ranges from 0 to 121.
- `gender`: Gender of the annotator. The values are 'male', 'female', 'prefer not to say', 'non-binary' and 'others'.
- `languages`: List of languages spoken by the annotator.
- `dialects`: Dialects reported by the annotator.
Some empty values may be represented as 'null'.
## Data Splits
### Human Annotations (Default)
The following are the splits of the data:
| Split | No. of instances | Language Coverage |
|-------|------------------|-------------------|
| train | 202,364 | All |
| test | 1,750 | 7 ('Standard Arabic', 'Yoruba', 'Turkish', 'English', 'Simplified Chinese', 'Portuguese', 'Telugu')|
### Demographics Data
The following are the splits of the data:
| Split | No. of Instances |
|-------|------------------|
| train | 1,456 |
## Data Instances
### Human Annotations (Default)
An example of `train` looks as follows:
```json
{
"inputs": "What cultural events or festivals add vibrancy to Colombo's calendar...",
"targets": "Colombo's cultural calendar is adorned with diverse events and festivals that celebrate the city's rich tapestry of traditions...",
"language": "English",
"language_code": "eng",
"annotation_type": "original-annotations",
"user_id": "f0ff69570af705b75c5a0851883e..."
}
```
### Demographics Data
An example of `train` looks as follows:
```json
{
"user_id": "f0ff69570af705b75c5a0851883e...",
"age_range": [ 25, 35 ],
"gender": "female",
"languages": [ "English", "Hausa" ],
"dialects": [ "Hausa" ]
}
```
## Statistics
### Annotation Types
The following is the breakdown of original annotations and re-annotations in the final dataset.
| Type of Annotation | Instances |
|--------------------|-----------|
| Original Annotations | 138,844 |
| Re-Annotations | 65,270 |
| Total | 204,114|
### Languages
The dataset covers 65 languages: 28 high-resource, 12 mid-resource, and 31 low-resource languages. The following is details about the languages, dialects & scripts included in the dataset.
<details>
<summary> Languages Info </summary>
| ISO Code | Language | Resources |
|----------|----------|-----------|
| `amh` | Amharic | Low |
| `arb`, `ary`, `ars`, `acq`, `arz` & `apc` | Arabic (Standard, Moroccan, Najdi, Ta'izzi-Adeni, Egyptian & South Levantine) | High |
| `ben` | Bengali | Mid |
| `ceb` | Cebuano | Mid |
| `dan` | Danish | Mid |
| `deu` | German | High |
| `ell` | Greek | Mid |
| `eng` | English | High |
| `eus` | Basque | High |
| `fil` | Filipino | Mid |
| `fin` | Finnish | Mid |
| `fra` | French | High |
| `gle` | Irish | Low |
| `guj` | Gujarati | Low |
| `hat` | Haitian Creole | Low |
| `hau` | Hausa | Low |
| `hin` | Hindi | High |
| `hun` | Hungarian | High |
| `ibo` | Igbo | Low |
| `ind` | Indonesian | Mid |
| `ita` | Italian | High |
| `jav` | Javanese | Low |
| `jpn` | Japanese | High |
| `kan` | Kannada | Low |
| `kir` | Kyrgyz | Low |
| `kor` | Korean | Mid |
| `kur` | Kurdish | Low |
| `lit` | Lithuanian | Mid |
| `mal` | Malayalam | Low |
| `mar` | Marathi | Low |
| `mlg` | Malagasy | Low |
| `msa` | Malay | Mid |
| `mya` | Burmese | Low |
| `nep` | Nepali | Low |
| `nld` | Dutch | High |
| `nso` | Northern Sotho | Low |
| `nya` | Chichewa | Low |
| `pan` | Punjabi | Low |
| `pes` | Persian | High |
| `pol` | Polish | High |
| `por` | Portuguese | High |
| `pus` | Pashto | Low |
| `rus` | Russian | High |
| `sin` | Sinhala | Low |
| `sna` | Shona | Low |
| `snd` | Sindhi | Low |
| `som` | Somali | Low |
| `spa` | Spanish | High |
| `sqi` | Albanian | Low |
| `srp` | Serbian | High |
| `sun` | Sundanese | Low |
| `swa` | Swahili | Low |
| `swe` | Swedish | High |
| `tam` | Tamil | Mid |
| `tel` | Telugu | Low |
| `tha` | Thai | Mid |
| `tur` | Turkish | High |
| `ukr` | Ukrainian | Mid |
| `urd` | Urdu | Mid |
| `vie` | Vietnamese | High |
| `wol` | Wolof | Low |
| `xho` | Xhosa | Low |
| `yor` | Yorùbá | Low |
| `zho` | Chinese (Traditional & Simplified) | High |
| `zul` | Zulu | Low |
</details>
<br>
# Motivations & Intentions
- **Curation Rationale:** The curation effort employed an open-science approach to create a diverse instruction-style dataset through annotators across the globe that ensures comprehensive representation across all languages. The success of the curation effort, led by volunteers across diverse backgrounds, was significantly influenced by their hope to meaningfully bring NLP advancements to their languages.
# Known Limitations
- **Language and dialect coverage:** The dataset covers a limited fraction of the world's linguistic diversity, with 93% of languages not represented, facing challenges in distinguishing between languages and dialects, lacking coverage for many regional dialects, and excluding programming languages.
- **Uneven distribution of contributions:** The dataset contains contributions in annotation activities, with a 'long tail' of annotators making only one or two contributions, leading to potential dataset imbalances across languages and a lack of diversity within certain language annotations.
- **Cultural and Personal Bias:** In the dataset, certain languages have limited representation due to a few dominant annotators, potentially leading to a narrow viewpoint and skewed distribution of content, particularly towards certain domains like news.
- **Gendered Pronouns:** Many of the languages in the Aya Dataset only contain pronouns that are explicitly gendered (e.g., Arabic) or that lack gender-neutral third-person pronouns for gender-neutral reference (e.g. Estonian).
- **Formality Distinctions:** The dataset encompasses languages with diverse formality distinctions, involving honorifics and situational choices in pronoun use, reflecting varying levels of standardization influenced by regional, cultural, and identity factors.
- **Toxic or Offensive Speech:** The Aya Annotation Platform lacked specific flags for toxic speech, relying on human verification and peer review to mitigate offensive content, but there's no guarantee that all potentially offensive data points were removed during the annotation process.
- **Accounting for mislabeled data:** The Aya Annotation Platform lacks re-labeling capabilities, leading to potential mislabeled data in the Aya Dataset, including instances of incorrect language assignments and non-compliance with instruction-style formatting.
# Additional Information
## Provenance
- **Methods Used:** Crowd-sourced through volunteer annotations, followed by a quality assessment phase in which samples from the dataset were checked.
- **Methodology Details:**
- *Source:* Original annotations and edits of opensource NLP datasets
- *Platform:* [Aya Annotation Platform](https://aya.for.ai/)
- *Dates of Collection:* May 2023 - Dec 2023
## Dataset Version and Maintenance
- **Maintenance Status:** Actively Maintained
- **Version Details:**
- *Current version:* 1.0
- *Last Update:* 02/2024
- *First Release:* 02/2024
- **Maintenance Plan:** Updates will be periodically made available based on volunteer contributions.
## Authorship
- **Publishing Organization:** [Cohere Labs](https://cohere.com/research)
- **Industry Type:** Not-for-profit - Tech
- **Contact Details:** https://aya.for.ai/
## Licensing Information
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License.
## Citation Information
```bibtex
@misc{singh2024aya,
title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning},
author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker},
year={2024},
eprint={2402.06619},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

# 数据集概览
`Aya数据集(Aya Dataset)`是由开放科学社区通过Cohere Labs旗下的[Aya标注平台(Aya Annotation Platform)](https://aya.for.ai/)整理的多语言指令微调数据集。该数据集总计包含20.4万条人类标注的提示词-补全对,以及标注者的人口统计数据。<br>本数据集可用于训练、微调与评估多语言大语言模型(LLM)。
- **整理方**:[Aya开放科学倡议(Aya Open Science Initiative)](https://aya.for.ai/)的贡献者。
- **语言覆盖**:65种语言(含方言与书写体系,共71种)。
- **许可协议**:[Apache 2.0](https://opensource.org/license/apache-2-0)
- **Aya数据集家族**:
| 数据集名称 | 说明 |
|------|--------------|
| [aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) | 人类标注的多语言指令微调数据集,涵盖65种语言的20.4万余条样本。 |
| [aya_collection](https://huggingface.co/datasets/CohereLabs/aya_collection) | 通过将母语使用者编写的指令式模板应用于44个数据集构建而成,其中包含19个指令式数据集的101种语言翻译版本,可为各类任务提供5.13亿条样本。 |
| [aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) | 按语言子集拆分的Aya合集。 |
| [aya_evaluation_suite](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite) | 面向多语言开放式生成的多样化评估集,包含7种语言的250条基于文化背景的提示词、24种语言的200条翻译提示词,以及从英文Dolly中筛选出的6种语言的、经人工编辑以适配跨文化相关性的版本。 |
| [aya_redteaming](https://huggingface.co/datasets/CohereLabs/aya_redteaming)| 包含8种语言有害提示词的对抗测试数据集,覆盖9类有害内容类别,并为“全球有害”与“本地有害”提供明确标签。 |
# 数据集构成
`Aya数据集`包含两类数据:
1. **人类标注数据**:原始标注(标注者全新编写的提示词与补全结果)与重新标注(对自动生成的提示词与补全结果进行的人工编辑)。
2. **人口统计数据**:每位标注者的匿名化信息。
## 使用Datasets库加载
如需使用`datasets`库加载该包含提示词-补全对与人口统计数据的数据集,只需先通过`pip install datasets --upgrade`安装并升级`datasets`库,随后使用如下代码:
python
from datasets import load_dataset
# Load the annotations dataset
aya_dataset = load_dataset("CohereLabs/aya_dataset")
# Load the demographics dataset
aya_demographics = load_dataset("CohereLabs/aya_dataset", "demographics")
## 数据字段
### 默认人类标注数据
所有拆分的数据字段均保持一致:
- `inputs`: 输入至语言模型的提示词或输入内容。
- `targets`: 语言模型生成的补全结果或输出内容。
- `language`: `inputs`与`targets`所使用的语言。
- `language_code`: `inputs`与`targets`所对应语言的ISO标准代码。
- `annotation_type`: 用于标识`inputs`与`targets`为`original_annotations`(原始标注)还是`re-annotations`(重新标注)的字段值。
- `user_id`: 提交该提示词-补全对的标注者的唯一标识符。
### 人口统计数据
所有拆分的数据字段均保持一致:
- `user_id`: 提交该提示词-补全对的标注者的唯一标识符。
- `age_range`: 标注者的年龄范围,取值范围为0至121。
- `gender`: 标注者的性别,可选值包括`male`(男性)、`female`(女性)、`prefer not to say`(不愿透露)、`non-binary`(非二元性别)与`others`(其他)。
- `languages`: 标注者会使用的语言列表。
- `dialects`: 标注者提及的方言列表。
部分空值将以`null`表示。
## 数据拆分
### 默认人类标注数据
该数据集的拆分情况如下:
| 拆分名称 | 样本数量 | 语言覆盖范围 |
|-------|------------------|-------------------|
| train | 202,364 | 全部语言 |
| test | 1,750 | 7种语言(标准阿拉伯语、约鲁巴语、土耳其语、英语、简体中文、葡萄牙语、泰卢固语)|
### 人口统计数据
该数据集的拆分情况如下:
| 拆分名称 | 样本数量 |
|-------|------------------|
| train | 1,456 |
## 数据示例
### 默认人类标注数据
`train`拆分的示例如下:
json
{
"inputs": "What cultural events or festivals add vibrancy to Colombo's calendar...",
"targets": "Colombo's cultural calendar is adorned with diverse events and festivals that celebrate the city's rich tapestry of traditions...",
"language": "English",
"language_code": "eng",
"annotation_type": "original-annotations",
"user_id": "f0ff69570af705b75c5a0851883e..."
}
### 人口统计数据
`train`拆分的示例如下:
json
{
"user_id": "f0ff69570af705b75c5a0851883e...",
"age_range": [ 25, 35 ],
"gender": "female",
"languages": [ "English", "Hausa" ],
"dialects": [ "Hausa" ]
}
## 统计信息
### 标注类型分布
最终数据集中原始标注与重新标注的分布情况如下:
| 标注类型 | 样本数量 |
|--------------------|-----------|
| 原始标注 | 138,844 |
| 重新标注 | 65,270 |
| 总计 | 204,114|
### 语言覆盖情况
该数据集覆盖65种语言,其中28种为高资源语言、12种为中资源语言、31种为低资源语言。以下为数据集包含的语言、方言与书写体系的详细信息。
<details>
<summary> 语言详情 </summary>
| ISO代码 | 语言 | 资源等级 |
|----------|----------|-----------|
| `amh` | 阿姆哈拉语 | 低 |
| `arb`, `ary`, `ars`, `acq`, `arz` & `apc` | 阿拉伯语(标准、摩洛哥、纳季迪、塔伊兹-亚丁、埃及与南黎凡特方言) | 高 |
| `ben` | 孟加拉语 | 中 |
| `ceb` | 宿务语 | 中 |
| `dan` | 丹麦语 | 中 |
| `deu` | 德语 | 高 |
| `ell` | 希腊语 | 中 |
| `eng` | 英语 | 高 |
| `eus` | 巴斯克语 | 高 |
| `fil` | 他加禄语 | 中 |
| `fin` | 芬兰语 | 中 |
| `fra` | 法语 | 高 |
| `gle` | 爱尔兰语 | 低 |
| `guj` | 古吉拉特语 | 低 |
| `hat` | 海地克里奥尔语 | 低 |
| `hau` | 豪萨语 | 低 |
| `hin` | 印地语 | 高 |
| `hun` | 匈牙利语 | 高 |
| `ibo` | 伊博语 | 低 |
| `ind` | 印度尼西亚语 | 中 |
| `ita` | 意大利语 | 高 |
| `jav` | 爪哇语 | 低 |
| `jpn` | 日语 | 高 |
| `kan` | 卡纳达语 | 低 |
| `kir` | 吉尔吉斯语 | 低 |
| `kor` | 韩语 | 中 |
| `kur` | 库尔德语 | 低 |
| `lit` | 立陶宛语 | 中 |
| `mal` | 马拉雅拉姆语 | 低 |
| `mar` | 马拉地语 | 低 |
| `mlg` | 马尔加什语 | 低 |
| `msa` | 马来语 | 中 |
| `mya` | 缅甸语 | 低 |
| `nep` | 尼泊尔语 | 低 |
| `nld` | 荷兰语 | 高 |
| `nso` | 北索托语 | 低 |
| `nya` | 齐切瓦语 | 低 |
| `pan` | 旁遮普语 | 低 |
| `pes` | 波斯语 | 高 |
| `pol` | 波兰语 | 高 |
| `por` | 葡萄牙语 | 高 |
| `pus` | 普什图语 | 低 |
| `rus` | 俄语 | 高 |
| `sin` | 僧伽罗语 | 低 |
| `sna` | 修纳语 | 低 |
| `snd` | 信德语 | 低 |
| `som` | 索马里语 | 低 |
| `spa` | 西班牙语 | 高 |
| `sqi` | 阿尔巴尼亚语 | 低 |
| `srp` | 塞尔维亚语 | 高 |
| `sun` | 巽他语 | 低 |
| `swa` | 斯瓦希里语 | 低 |
| `swe` | 瑞典语 | 高 |
| `tam` | 泰米尔语 | 中 |
| `tel` | 泰卢固语 | 低 |
| `tha` | 泰语 | 中 |
| `tur` | 土耳其语 | 高 |
| `ukr` | 乌克兰语 | 中 |
| `urd` | 乌尔都语 | 中 |
| `vie` | 越南语 | 高 |
| `wol` | 沃洛夫语 | 低 |
| `xho` | 科萨语 | 低 |
| `yor` | 约鲁巴语 | 低 |
| `zho` | 汉语(繁体与简体) | 高 |
| `zul` | 祖鲁语 | 低 |
</details>
<br>
# 构建初衷与愿景
- **整理依据**:本次整理工作采用开放科学模式,通过全球各地的标注者构建多样化的指令式数据集,确保所有语言都能得到充分覆盖。本次整理工作由来自不同背景的志愿者主导,其成功很大程度上得益于他们希望为自身母语推动自然语言处理(Natural Language Processing, NLP)发展的美好愿景。
# 已知局限性
- **语言与方言覆盖局限**:该数据集仅覆盖全球语言多样性的极小一部分,93%的语言未被纳入,同时在区分语言与方言方面存在挑战,对众多区域方言的覆盖不足,且未包含编程语言。
- **贡献分布不均**:数据集中的标注贡献存在“长尾”现象,即大量标注者仅贡献了1至2条样本,这可能导致不同语言间的数据集分布失衡,且部分语言的标注内容缺乏多样性。
- **文化与个人偏见**:数据集中部分语言的样本仅由少数主导性标注者贡献,这可能导致视角狭窄、内容分布失衡,尤其偏向新闻等特定领域。
- **性别代词局限**:Aya数据集中的许多语言仅包含明确的性别化代词(如阿拉伯语),或缺乏用于中性指代的无性别第三人称代词(如爱沙尼亚语)。
- **正式程度区分**:数据集涵盖的语言存在多样的正式程度区分,包括敬语与代词使用的情境选择,这反映了受区域、文化与身份因素影响的标准化水平差异。
- **有毒或冒犯性内容**:Aya标注平台未设置针对有毒言论的特定标记,仅依靠人工审核与同行评议来缓解冒犯性内容,但无法保证所有潜在冒犯性数据点都已在标注过程中被移除。
- **错误标注数据问题**:Aya标注平台缺乏重新标注功能,这可能导致Aya数据集中存在错误标注的样本,包括语言分配错误与不符合指令式格式的内容。
# 补充信息
## 数据来源
- **所用方法**:通过志愿者众包标注,随后进行质量评估阶段,对数据集样本进行核查。
- **方法细节**:
- *数据来源*:原始标注与开源自然语言处理数据集的编辑版本
- *标注平台*:[Aya标注平台(Aya Annotation Platform)](https://aya.for.ai/)
- *收集日期*:2023年5月 - 2023年12月
## 数据集版本与维护
- **维护状态**:持续维护中
- **版本详情**:
- *当前版本*:1.0
- *最后更新*:2024年2月
- *首次发布*:2024年2月
- **维护计划**:将根据志愿者贡献定期发布更新。
## 作者信息
- **发布机构**:[Cohere Labs](https://cohere.com/research)
- **行业类型**:非营利性科技机构
- **联系方式**:https://aya.for.ai/
## 许可信息
本数据集可根据[Apache 2.0许可协议](https://opensource.org/license/apache-2-0)的条款,用于学术或商业等任何用途。
## 引用信息
bibtex
@misc{singh2024aya,
title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning},
author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker},
year={2024},
eprint={2402.06619},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2024-12-15



