multilingual_librispeech
收藏魔搭社区2026-05-12 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/multilingual_librispeech
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MultiLingual LibriSpeech
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [How to use](#how-to-use)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [MultiLingual LibriSpeech ASR corpus](http://www.openslr.org/94)
- **Repository:** [Needs More Information]
- **Paper:** [MLS: A Large-Scale Multilingual Dataset for Speech Research](https://arxiv.org/abs/2012.03411)
- **Leaderboard:** [🤗 Autoevaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=facebook%2Fmultilingual_librispeech&only_verified=0&task=automatic-speech-recognition&config=-unspecified-&split=-unspecified-&metric=wer)
### Dataset Summary
This is a streamable version of the Multilingual LibriSpeech (MLS) dataset.
The data archives were restructured from the original ones from [OpenSLR](http://www.openslr.org/94) to make it easier to stream.
MLS dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of
8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages.
### Supported Tasks and Leaderboards
- `automatic-speech-recognition`, `speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active leaderboard which can be found at https://paperswithcode.com/dataset/multilingual-librispeech and ranks models based on their WER.
- `text-to-speech`, `text-to-audio`: The dataset can also be used to train a model for Text-To-Speech (TTS).
### Languages
The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish
### How to use
The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function.
For example, to download the German config, simply specify the corresponding language config name (i.e., "german" for German):
```python
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
```
Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
```python
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
print(next(iter(mls)))
```
*Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed).
Local:
```python
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
batch_sampler = BatchSampler(RandomSampler(mls), batch_size=32, drop_last=False)
dataloader = DataLoader(mls, batch_sampler=batch_sampler)
```
Streaming:
```python
from datasets import load_dataset
from torch.utils.data import DataLoader
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
dataloader = DataLoader(mls, batch_size=32)
```
To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets).
### Example scripts
Train your own CTC or Seq2Seq Automatic Speech Recognition models on MultiLingual Librispeech with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition).
## Dataset Structure
### Data Instances
A typical data point comprises the path to the audio file, usually called `file` and its transcription, called `text`. Some additional information about the speaker and the passage which contains the transcription is provided.
```
{'file': '10900_6473_000030.flac',
'audio': {'path': '10900_6473_000030.flac',
'array': array([-1.52587891e-04, 6.10351562e-05, 0.00000000e+00, ...,
4.27246094e-04, 5.49316406e-04, 4.57763672e-04]),
'sampling_rate': 16000},
'text': 'więc czego chcecie odemnie spytałem wysłuchawszy tego zadziwiającego opowiadania broń nas stary człowieku broń zakrzyknęli równocześnie obaj posłowie\n',
'speaker_id': 10900,
'chapter_id': 6473,
'id': '10900_6473_000030'}
```
### Data Fields
- file: A filename .flac format.
- audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
- text: the transcription of the audio file.
- id: unique id of the data sample.
- speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
- chapter_id: id of the audiobook chapter which includes the transcription.
### Data Splits
| Number of samples | Train | Train.9h | Train.1h | Dev | Test |
| ----- | ------ | ----- | ---- | ---- | ---- |
| german | 469942 | 2194 | 241 | 3469 | 3394 |
| dutch | 374287 | 2153 | 234 | 3095 | 3075 |
| french | 258213 | 2167 | 241 | 2416 | 2426 |
| spanish | 220701 | 2110 | 233 | 2408 | 2385 |
| italian | 59623 | 2173 | 240 | 1248 | 1262 |
| portuguese | 37533 | 2116 | 236 | 826 | 871 |
| polish | 25043 | 2173 | 238 | 512 | 520 |
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode))
### Citation Information
```
@article{Pratap2020MLSAL,
title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
journal={ArXiv},
year={2020},
volume={abs/2012.03411}
}
```
### Data Statistics
| Duration (h) | Train | Dev | Test |
|--------------|-----------|-------|-------|
| English | 44,659.74 | 15.75 | 15.55 |
| German | 1,966.51 | 14.28 | 14.29 |
| Dutch | 1,554.24 | 12.76 | 12.76 |
| French | 1,076.58 | 10.07 | 10.07 |
| Spanish | 917.68 | 9.99 | 10 |
| Italian | 247.38 | 5.18 | 5.27 |
| Portuguese | 160.96 | 3.64 | 3.74 |
| Polish | 103.65 | 2.08 | 2.14 |
| # Speakers | Train | | Dev | | Test | |
|------------|-------|------|-----|----|------|----|
| Gender | M | F | M | F | M | F |
| English | 2742 | 2748 | 21 | 21 | 21 | 21 |
| German | 81 | 95 | 15 | 15 | 15 | 15 |
| Dutch | 9 | 31 | 3 | 3 | 3 | 3 |
| French | 62 | 80 | 9 | 9 | 9 | 9 |
| Spanish | 36 | 50 | 10 | 10 | 10 | 10 |
| Italian | 22 | 43 | 5 | 5 | 5 | 5 |
| Portuguese | 26 | 16 | 5 | 5 | 5 | 5 |
| Polish | 6 | 5 | 2 | 2 | 2 | 2 |
| # Hours / Gender | Dev | | Test | |
|------------------|------|------|------|------|
| Gender | M | F | M | F |
| English | 7.76 | 7.99 | 7.62 | 7.93 |
| German | 7.06 | 7.22 | 7 | 7.29 |
| Dutch | 6.44 | 6.32 | 6.72 | 6.04 |
| French | 5.13 | 4.94 | 5.04 | 5.02 |
| Spanish | 4.91 | 5.08 | 4.78 | 5.23 |
| Italian | 2.5 | 2.68 | 2.38 | 2.9 |
| Portuguese | 1.84 | 1.81 | 1.83 | 1.9 |
| Polish | 1.12 | 0.95 | 1.09 | 1.05 |
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten) and [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.
# 多语言LibriSpeech(MultiLingual LibriSpeech)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [使用方法](#how-to-use)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选原则](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页:** [多语言LibriSpeech ASR语料库](http://www.openslr.org/94)
- **代码仓库:** [需补充更多信息]
- **相关论文:** [MLS:面向语音研究的大规模多语言数据集](https://arxiv.org/abs/2012.03411)
- **排行榜:** [🤗 Hugging Face Autoevaluate排行榜](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=facebook%2Fmultilingual_librispeech&only_verified=0&task=automatic-speech-recognition&config=-unspecified-&split=-unspecified-&metric=wer)
### 数据集概况
本数据集为多语言LibriSpeech(MLS)数据集的流式读取版本。数据归档文件源自OpenSLR的原始版本,并经过重构以简化流式读取流程。
MLS数据集是一款适用于语音研究的大型多语言语料库,其数据源自LibriVox平台的有声读物朗读音频,涵盖8种语言:英语、德语、荷兰语、西班牙语、法语、意大利语、葡萄牙语与波兰语。其中英语语料时长约44.5千小时,其余语言总时长约6千小时。
### 支持任务与排行榜
- `automatic-speech-recognition`、`speaker-identification`:本数据集可用于训练自动语音识别(ASR)模型,模型需接收音频文件并将其转录为书面文本,最常用的评估指标为词错误率(WER)。该任务设有公开排行榜,可通过https://paperswithcode.com/dataset/multilingual-librispeech访问,排行榜基于模型的词错误率对模型进行排名。
- `text-to-speech`、`text-to-audio`:本数据集也可用于训练文本转语音(TTS)模型。
### 语言覆盖
本数据集源自LibriVox平台的有声读物朗读音频,涵盖8种语言:英语、德语、荷兰语、西班牙语、法语、意大利语、葡萄牙语与波兰语。
### 使用方法
借助`datasets`库,您可通过纯Python语言规模化加载并预处理数据集。只需调用`load_dataset`函数,即可将数据集下载并配置至本地磁盘。
例如,若要下载德语配置,只需指定对应的语言配置名称(即德语对应`german`):
python
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
借助`datasets`库,您还可通过在`load_dataset`函数调用中添加`streaming=True`参数实现实时流式加载数据集。流式加载模式下,系统会逐个加载数据样本,而非将整个数据集下载至本地磁盘。
python
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
print(next(iter(mls)))
*拓展用法:* 可直接结合自定义数据集(本地/流式)创建PyTorch数据加载器,详情参见https://huggingface.co/docs/datasets/use_with_pytorch。
本地加载示例:
python
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
batch_sampler = BatchSampler(RandomSampler(mls), batch_size=32, drop_last=False)
dataloader = DataLoader(mls, batch_sampler=batch_sampler)
流式加载示例:
python
from datasets import load_dataset
from torch.utils.data import DataLoader
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
dataloader = DataLoader(mls, batch_size=32)
如需了解更多关于音频数据集加载与预处理的内容,请访问https://huggingface.co/blog/audio-datasets。
### 示例脚本
借助`transformers`库在多语言LibriSpeech数据集上训练自定义CTC或Seq2Seq自动语音识别模型,详见https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition。
## 数据集结构
### 数据实例
一个典型的数据样本包含音频文件路径(通常命名为`file`)及其转录文本(命名为`text`),此外还会提供说话人与对应朗读章节的额外信息。
{'file': '10900_6473_000030.flac',
'audio': {'path': '10900_6473_000030.flac',
'array': array([-1.52587891e-04, 6.10351562e-05, 0.00000000e+00, ...,
4.27246094e-04, 5.49316406e-04, 4.57763672e-04]),
'sampling_rate': 16000},
'text': 'więc czego chcecie odemnie spytałem wysłuchawszy tego zadziwiającego opowiadania broń nas stary człowieku broń zakrzyknęli równocześnie obaj posłowie
',
'speaker_id': 10900,
'chapter_id': 6473,
'id': '10900_6473_000030'}
### 数据字段
- `file`:FLAC格式的文件名。
- `audio`:包含音频文件名、解码后的音频数组与采样率的字典。请注意,当访问音频列时,例如`dataset[0]["audio"]`,音频文件会自动解码并重采样至`dataset.features["audio"].sampling_rate`指定的采样率。批量解码与重采样大量音频文件可能会耗费较长时间,因此建议优先通过样本索引访问音频列,即始终优先使用`dataset[0]["audio"]`而非`dataset["audio"][0]`。
- `text`:音频文件的转录文本。
- `id`:数据样本的唯一标识符。
- `speaker_id`:说话人的唯一标识符,同一说话人ID可对应多个数据样本。
- `chapter_id`:包含该转录文本的有声读物章节ID。
### 数据划分
| 样本总数 | 训练集 | 训练集(9小时子集) | 训练集(1小时子集) | 验证集 | 测试集 |
| ---- | ---- | ---- | ---- | ---- | ---- |
| 德语 | 469942 | 2194 | 241 | 3469 | 3394 |
| 荷兰语 | 374287 | 2153 | 234 | 3095 | 3075 |
| 法语 | 258213 | 2167 | 241 | 2416 | 2426 |
| 西班牙语 | 220701 | 2110 | 233 | 2408 | 2385 |
| 意大利语 | 59623 | 2173 | 240 | 1248 | 1262 |
| 葡萄牙语 | 37533 | 2116 | 236 | 826 | 871 |
| 波兰语 | 25043 | 2173 | 238 | 512 | 520 |
## 数据集构建
### 遴选原则
[需补充更多信息]
### 源数据
#### 初始数据收集与归一化
[需补充更多信息]
#### 源语言制作者身份
[需补充更多信息]
### 标注
#### 标注流程
[需补充更多信息]
#### 标注者身份
[需补充更多信息]
### 个人与敏感信息
本数据集由在线自愿贡献语音数据的民众构成。请您承诺不会尝试识别数据集中的说话人身份。
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集策展人
[需补充更多信息]
### 许可信息
公有领域,知识共享署名4.0国际公共许可协议(CC-BY-4.0,https://creativecommons.org/licenses/by/4.0/legalcode)
### 引用信息
@article{Pratap2020MLSAL,
title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
journal={ArXiv},
year={2020},
volume={abs/2012.03411}
}
### 数据统计
| 时长(小时) | 训练集 | 验证集 | 测试集 |
| ---- | ---- | ---- | ---- |
| 英语 | 44,659.74 | 15.75 | 15.55 |
| 德语 | 1,966.51 | 14.28 | 14.29 |
| 荷兰语 | 1,554.24 | 12.76 | 12.76 |
| 法语 | 1,076.58 | 10.07 | 10.07 |
| 西班牙语 | 917.68 | 9.99 | 10 |
| 意大利语 | 247.38 | 5.18 | 5.27 |
| 葡萄牙语 | 160.96 | 3.64 | 3.74 |
| 波兰语 | 103.65 | 2.08 | 2.14 |
| 说话人数量 | 训练集 | | 验证集 | | 测试集 | |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 性别 | 男性(M) | 女性(F) | 男性(M) | 女性(F) | 男性(M) | 女性(F) |
| 英语 | 2742 | 2748 | 21 | 21 | 21 | 21 |
| 德语 | 81 | 95 | 15 | 15 | 15 | 15 |
| 荷兰语 | 9 | 31 | 3 | 3 | 3 | 3 |
| 法语 | 62 | 80 | 9 | 9 | 9 | 9 |
| 西班牙语 | 36 | 50 | 10 | 10 | 10 | 10 |
| 意大利语 | 22 | 43 | 5 | 5 | 5 | 5 |
| 葡萄牙语 | 26 | 16 | 5 | 5 | 5 | 5 |
| 波兰语 | 6 | 5 | 2 | 2 | 2 | 2 |
| 按性别划分的时长(小时) | 验证集 | | 测试集 | |
| ---- | ---- | ---- | ---- | ---- |
| 性别 | 男性(M) | 女性(F) | 男性(M) | 女性(F) |
| 英语 | 7.76 | 7.99 | 7.62 | 7.93 |
| 德语 | 7.06 | 7.22 | 7 | 7.29 |
| 荷兰语 | 6.44 | 6.32 | 6.72 | 6.04 |
| 法语 | 5.13 | 4.94 | 5.04 | 5.02 |
| 西班牙语 | 4.91 | 5.08 | 4.78 | 5.23 |
| 意大利语 | 2.5 | 2.68 | 2.38 | 2.9 |
| 葡萄牙语 | 1.84 | 1.81 | 1.83 | 1.9 |
| 波兰语 | 1.12 | 0.95 | 1.09 | 1.05 |
### 贡献
感谢[@patrickvonplaten](https://github.com/patrickvonplaten)与[@polinaeterna](https://github.com/polinaeterna)为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2025-05-20



