covost2
收藏魔搭社区2026-05-18 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/covost2
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for covost2
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/facebookresearch/covost
- **Repository:** https://github.com/facebookresearch/covost
- **Paper:** https://arxiv.org/abs/2007.10310
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)
### Dataset Summary
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English \
and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of \
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.
### Supported Tasks and Leaderboards
`speech-translation`: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md .
### Languages
The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese.
## Dataset Structure
### Data Instances
A typical data point comprises the path to the audio file, usually called `file`, its transcription, called `sentence`, and the translation in target language called `translation`.
```
{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658',
'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
'audio': {'path': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 48000},
'id': 'common_voice_en_18540003',
'sentence': 'When water is scarce, avoid wasting it.',
'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'}
```
### Data Fields
- file: A path to the downloaded audio file in .mp3 format.
- audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
- sentence: The transcription of the audio file in source language.
- translation: The transcription of the audio file in the target language.
- id: unique id of the data sample.
### Data Splits
| config | train | validation | test |
|----------|--------|------------|-------|
| en_de | 289430 | 15531 | 15531 |
| en_tr | 289430 | 15531 | 15531 |
| en_fa | 289430 | 15531 | 15531 |
| en_sv-SE | 289430 | 15531 | 15531 |
| en_mn | 289430 | 15531 | 15531 |
| en_zh-CN | 289430 | 15531 | 15531 |
| en_cy | 289430 | 15531 | 15531 |
| en_ca | 289430 | 15531 | 15531 |
| en_sl | 289430 | 15531 | 15531 |
| en_et | 289430 | 15531 | 15531 |
| en_id | 289430 | 15531 | 15531 |
| en_ar | 289430 | 15531 | 15531 |
| en_ta | 289430 | 15531 | 15531 |
| en_lv | 289430 | 15531 | 15531 |
| en_ja | 289430 | 15531 | 15531 |
| fr_en | 207374 | 14760 | 14760 |
| de_en | 127834 | 13511 | 13511 |
| es_en | 79015 | 13221 | 13221 |
| ca_en | 95854 | 12730 | 12730 |
| it_en | 31698 | 8940 | 8951 |
| ru_en | 12112 | 6110 | 6300 |
| zh-CN_en | 7085 | 4843 | 4898 |
| pt_en | 9158 | 3318 | 4023 |
| fa_en | 53949 | 3445 | 3445 |
| et_en | 1782 | 1576 | 1571 |
| mn_en | 2067 | 1761 | 1759 |
| nl_en | 7108 | 1699 | 1699 |
| tr_en | 3966 | 1624 | 1629 |
| ar_en | 2283 | 1758 | 1695 |
| sv-SE_en | 2160 | 1349 | 1595 |
| lv_en | 2337 | 1125 | 1629 |
| sl_en | 1843 | 509 | 360 |
| ta_en | 1358 | 384 | 786 |
| ja_en | 1119 | 635 | 684 |
| id_en | 1243 | 792 | 844 |
| cy_en | 1241 | 690 | 690 |
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[CC BY-NC 4.0](https://github.com/facebookresearch/covost/blob/main/LICENSE)
### Citation Information
```
@misc{wang2020covost,
title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
author={Changhan Wang and Anne Wu and Juan Pino},
year={2020},
eprint={2007.10310},
archivePrefix={arXiv},
primaryClass={cs.CL}
```
### Contributions
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
# CoVoST 2 数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [支持语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见分析](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集概述
- **主页**:https://github.com/facebookresearch/covost
- **代码仓库**:https://github.com/facebookresearch/covost
- **相关论文**:https://arxiv.org/abs/2007.10310
- **排行榜**:有待补充
- **联系人**:王常涵(changhan@fb.com)、胡安·米格尔·皮诺(juancarabina@fb.com)、郭家涛(jgu@fb.com)
### 数据集摘要
CoVoST 2是一款大规模多语言语音翻译语料库,涵盖21种语言到英语的翻译任务,以及英语到15种语言的翻译任务。该数据集基于Mozilla的开源**通用语音(Common Voice)**众包语音录制数据库构建,总语音时长达2900小时。
### 支持任务与排行榜
`语音翻译(speech-translation)`:该数据集可用于语音到文本翻译(Speech-to-text Translation, ST)任务。模型接收某一语言的音频文件,需将其转录为另一语言的书面文本。最常用的评估指标为**BLEU评分(BLEU score)**。示例可参考:https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md。
### 支持语言
该数据集包含以下语言的音频、转录文本与翻译文本:法语、德语、荷兰语、俄语、西班牙语、意大利语、土耳其语、波斯语、瑞典语、蒙古语、中文、威尔士语、加泰罗尼亚语、斯洛文尼亚语、爱沙尼亚语、印度尼西亚语、阿拉伯语、泰米尔语、葡萄牙语、拉脱维亚语及日语。
## 数据集结构
### 数据实例
典型数据样本包含音频文件路径(通常命名为`file`)、源语言转录文本(命名为`sentence`)以及目标语言翻译文本(命名为`translation`)。
{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658',
'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
'audio': {'path': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 48000},
'id': 'common_voice_en_18540003',
'sentence': 'When water is scarce, avoid wasting it.',
'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'}
### 数据字段
- `file`:指向下载的`.mp3`格式音频文件的路径。
- `audio`:包含音频文件路径、解码后的音频数组以及采样率的字典。请注意,当访问音频列时,例如`dataset[0]["audio"]`,音频文件会自动被解码并重采样为`dataset.features["audio"].sampling_rate`指定的采样率。对大量音频文件进行解码与重采样可能会耗费大量时间,因此建议优先通过样本索引查询,即**始终优先使用`dataset[0]["audio"]`而非`dataset["audio"][0]`**。
- `sentence`:音频文件的源语言转录文本。
- `translation`:音频文件的目标语言翻译文本。
- `id`:数据样本的唯一标识符。
### 数据划分
| 配置名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 |
|----------|--------|------------|-------|
| en_de | 289430 | 15531 | 15531 |
| en_tr | 289430 | 15531 | 15531 |
| en_fa | 289430 | 15531 | 15531 |
| en_sv-SE | 289430 | 15531 | 15531 |
| en_mn | 289430 | 15531 | 15531 |
| en_zh-CN | 289430 | 15531 | 15531 |
| en_cy | 289430 | 15531 | 15531 |
| en_ca | 289430 | 15531 | 15531 |
| en_sl | 289430 | 15531 | 15531 |
| en_et | 289430 | 15531 | 15531 |
| en_id | 289430 | 15531 | 15531 |
| en_ar | 289430 | 15531 | 15531 |
| en_ta | 289430 | 15531 | 15531 |
| en_lv | 289430 | 15531 | 15531 |
| en_ja | 289430 | 15531 | 15531 |
| fr_en | 207374 | 14760 | 14760 |
| de_en | 127834 | 13511 | 13511 |
| es_en | 79015 | 13221 | 13221 |
| ca_en | 95854 | 12730 | 12730 |
| it_en | 31698 | 8940 | 8951 |
| ru_en | 12112 | 6110 | 6300 |
| zh-CN_en | 7085 | 4843 | 4898 |
| pt_en | 9158 | 3318 | 4023 |
| fa_en | 53949 | 3445 | 3445 |
| et_en | 1782 | 1576 | 1571 |
| mn_en | 2067 | 1761 | 1759 |
| nl_en | 7108 | 1699 | 1699 |
| tr_en | 3966 | 1624 | 1629 |
| ar_en | 2283 | 1758 | 1695 |
| sv-SE_en | 2160 | 1349 | 1595 |
| lv_en | 2337 | 1125 | 1629 |
| sl_en | 1843 | 509 | 360 |
| ta_en | 1358 | 384 | 786 |
| ja_en | 1119 | 635 | 684 |
| id_en | 1243 | 792 | 844 |
| cy_en | 1241 | 690 | 690 |
## 数据集构建
### 构建初衷
有待补充
### 源数据
#### 初始数据收集与标准化
有待补充
#### 源语言提供者是谁?
有待补充
### 标注
#### 标注流程
有待补充
#### 标注者是谁?
有待补充
### 个人与敏感信息
该数据集由在线捐赠语音的志愿者构成,请勿尝试识别数据集中说话者的身份。
## 数据集使用注意事项
### 数据集的社会影响
有待补充
### 偏见分析
有待补充
### 其他已知局限
有待补充
## 附加信息
### 数据集整理者
有待补充
### 许可信息
[知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)](https://github.com/facebookresearch/covost/blob/main/LICENSE)
### 引用信息
@misc{wang2020covost,
title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
author={Changhan Wang and Anne Wu and Juan Pino},
year={2020},
eprint={2007.10310},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献致谢
感谢[@patil-suraj](https://github.com/patil-suraj) 为本数据集添加了支持。
提供机构:
maas
创建时间:
2025-02-25
搜集汇总
数据集介绍

背景与挑战
背景概述
CoVoST 2是一个大规模多语言语音翻译语料库,涵盖21种语言到英语和英语到15种语言的翻译,包含2,900小时的语音数据,适用于语音到文本翻译任务。
以上内容由遇见数据集搜集并总结生成



