legacy-datasets/multilingual_librispeech

Name: legacy-datasets/multilingual_librispeech
Creator: legacy-datasets
Published: 2024-09-10 07:39:32
License: 暂无描述

Hugging Face2024-09-10 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/legacy-datasets/multilingual_librispeech

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: MultiLingual LibriSpeech annotations_creators: - expert-generated language_creators: - crowdsourced - expert-generated language: - de - es - fr - it - nl - pl - pt license: - cc-by-4.0 multilinguality: - multilingual paperswithcode_id: librispeech-1 size_categories: - 100K<n<1M source_datasets: - original task_categories: - automatic-speech-recognition - audio-classification task_ids: - speaker-identification dataset_info: - config_name: polish features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 16136430 num_examples: 25043 - name: train.9h num_bytes: 1383232 num_examples: 2173 - name: train.1h num_bytes: 145411 num_examples: 238 - name: validation num_bytes: 318964 num_examples: 512 - name: test num_bytes: 332317 num_examples: 520 download_size: 6609569551 dataset_size: 18316354 - config_name: german features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 277089334 num_examples: 469942 - name: train.9h num_bytes: 1325460 num_examples: 2194 - name: train.1h num_bytes: 145998 num_examples: 241 - name: validation num_bytes: 2160779 num_examples: 3469 - name: test num_bytes: 2131177 num_examples: 3394 download_size: 122944886305 dataset_size: 282852748 - config_name: dutch features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 218648573 num_examples: 374287 - name: train.9h num_bytes: 1281951 num_examples: 2153 - name: train.1h num_bytes: 141672 num_examples: 234 - name: validation num_bytes: 1984165 num_examples: 3095 - name: test num_bytes: 1945428 num_examples: 3075 download_size: 92158429530 dataset_size: 224001789 - config_name: french features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 162009691 num_examples: 258213 - name: train.9h num_bytes: 1347707 num_examples: 2167 - name: train.1h num_bytes: 146699 num_examples: 241 - name: validation num_bytes: 1482961 num_examples: 2416 - name: test num_bytes: 1539152 num_examples: 2426 download_size: 64474642518 dataset_size: 166526210 - config_name: spanish features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 136743162 num_examples: 220701 - name: train.9h num_bytes: 1288180 num_examples: 2110 - name: train.1h num_bytes: 138734 num_examples: 233 - name: validation num_bytes: 1463115 num_examples: 2408 - name: test num_bytes: 1464565 num_examples: 2385 download_size: 53296894035 dataset_size: 141097756 - config_name: italian features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 36008104 num_examples: 59623 - name: train.9h num_bytes: 1325927 num_examples: 2173 - name: train.1h num_bytes: 145006 num_examples: 240 - name: validation num_bytes: 732210 num_examples: 1248 - name: test num_bytes: 746977 num_examples: 1262 download_size: 15395281399 dataset_size: 38958224 - config_name: portuguese features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train num_bytes: 23036487 num_examples: 37533 - name: train.9h num_bytes: 1305698 num_examples: 2116 - name: train.1h num_bytes: 143781 num_examples: 236 - name: validation num_bytes: 512463 num_examples: 826 - name: test num_bytes: 549893 num_examples: 871 download_size: 9982803818 dataset_size: 25548322 --- # Dataset Card for MultiLingual LibriSpeech ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [MultiLingual LibriSpeech ASR corpus](http://www.openslr.org/94) - **Repository:** [Needs More Information] - **Paper:** [MLS: A Large-Scale Multilingual Dataset for Speech Research](https://arxiv.org/abs/2012.03411) - **Leaderboard:** [Paperswithcode Leaderboard](https://paperswithcode.com/dataset/multilingual-librispeech) ### Dataset Summary <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"> <p><b>Deprecated:</b> This legacy dataset doesn't support streaming and is not updated. Use "facebook/multilingual_librispeech" instead.</p> </div> Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. ### Supported Tasks and Leaderboards - `automatic-speech-recognition`, `audio-speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active leaderboard which can be found at https://paperswithcode.com/dataset/multilingual-librispeech and ranks models based on their WER. ### Languages The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file, usually called `file` and its transcription, called `text`. Some additional information about the speaker and the passage which contains the transcription is provided. ``` {'chapter_id': 141231, 'file': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': {'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000}, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'} ``` ### Data Fields - file: A path to the downloaded audio file in .flac format. - audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. - text: the transcription of the audio file. - id: unique id of the data sample. - speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples. - chapter_id: id of the audiobook chapter which includes the transcription. ### Data Splits | | Train | Train.9h | Train.1h | Dev | Test | | ----- | ------ | ----- | ---- | ---- | ---- | | german | 469942 | 2194 | 241 | 3469 | 3394 | | dutch | 374287 | 2153 | 234 | 3095 | 3075 | | french | 258213 | 2167 | 241 | 2416 | 2426 | | spanish | 220701 | 2110 | 233 | 2408 | 2385 | | italian | 59623 | 2173 | 240 | 1248 | 1262 | | portuguese | 37533 | 2116 | 236 | 826 | 871 | | polish | 25043 | 2173 | 238 | 512 | 520 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) ### Citation Information ``` @article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

提供机构：

legacy-datasets

原始信息汇总

数据集卡片 MultiLingual LibriSpeech

数据集描述

数据集概述

Multilingual LibriSpeech (MLS) 数据集是一个适用于语音研究的大型多语言语料库。该数据集源自 LibriVox 的有声读物，包含 8 种语言：英语、德语、荷兰语、西班牙语、法语、意大利语、葡萄牙语和波兰语。

支持的任务和排行榜

automatic-speech-recognition（自动语音识别）
audio-speaker-identification（音频说话人识别）

该数据集可用于训练自动语音识别 (ASR) 模型。模型接收音频文件并将其转录为书面文本。最常见的评估指标是词错误率 (WER)。任务有一个活跃的排行榜，可以在 Paperswithcode Leaderboard 上找到，并根据 WER 对模型进行排名。

语言

数据集包含以下 8 种语言：

德语 (de)
西班牙语 (es)
法语 (fr)
意大利语 (it)
荷兰语 (nl)
波兰语 (pl)
葡萄牙语 (pt)

数据集结构

数据实例

一个典型的数据点包括音频文件的路径（通常称为 file）及其转录文本（称为 text）。还提供了有关说话人和包含转录文本的章节的额外信息。

json { "chapter_id": 141231, "file": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac", "audio": { "path": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac", "array": [...], "sampling_rate": 16000 }, "id": "1272-141231-0000", "speaker_id": 1272, "text": "A MAN SAID TO THE UNIVERSE SIR I EXIST" }

数据字段

file: 下载的音频文件路径，格式为 .flac。
audio: 包含下载的音频文件路径、解码的音频数组和采样率的字典。
text: 音频文件的转录文本。
id: 数据样本的唯一标识符。
speaker_id: 说话人的唯一标识符。
chapter_id: 包含转录文本的有声书章节的标识符。

数据分割

语言	训练集	训练集 (9h)	训练集 (1h)	验证集	测试集
德语 (de)	469942	2194	241	3469	3394
荷兰语 (nl)	374287	2153	234	3095	3075
法语 (fr)	258213	2167	241	2416	2426
西班牙语 (es)	220701	2110	233	2408	2385
意大利语 (it)	59623	2173	240	1248	1262
葡萄牙语 (pt)	37533	2116	236	826	871
波兰语 (pl)	25043	2173	238	512	520

数据集创建

数据集信息

config_name: 语言配置名称
features: 数据特征
- file: 文件路径，类型为 string
- audio: 音频信息，包含采样率 sampling_rate: 16000
- text: 转录文本，类型为 string
- speaker_id: 说话人标识符，类型为 int64
- chapter_id: 章节标识符，类型为 int64
- id: 样本标识符，类型为 string
splits: 数据分割信息
- train: 训练集
- train.9h: 9小时训练集
- train.1h: 1小时训练集
- validation: 验证集
- test: 测试集
download_size: 下载大小
dataset_size: 数据集大小

波兰语 (polish)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 16136430, num_examples: 25043
- train.9h: num_bytes: 1383232, num_examples: 2173
- train.1h: num_bytes: 145411, num_examples: 238
- validation: num_bytes: 318964, num_examples: 512
- test: num_bytes: 332317, num_examples: 520
download_size: 6609569551
dataset_size: 18316354

德语 (german)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 277089334, num_examples: 469942
- train.9h: num_bytes: 1325460, num_examples: 2194
- train.1h: num_bytes: 145998, num_examples: 241
- validation: num_bytes: 2160779, num_examples: 3469
- test: num_bytes: 2131177, num_examples: 3394
download_size: 122944886305
dataset_size: 282852748

荷兰语 (dutch)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 218648573, num_examples: 374287
- train.9h: num_bytes: 1281951, num_examples: 2153
- train.1h: num_bytes: 141672, num_examples: 234
- validation: num_bytes: 1984165, num_examples: 3095
- test: num_bytes: 1945428, num_examples: 3075
download_size: 92158429530
dataset_size: 224001789

法语 (french)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 162009691, num_examples: 258213
- train.9h: num_bytes: 1347707, num_examples: 2167
- train.1h: num_bytes: 146699, num_examples: 241
- validation: num_bytes: 1482961, num_examples: 2416
- test: num_bytes: 1539152, num_examples: 2426
download_size: 64474642518
dataset_size: 166526210

西班牙语 (spanish)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 136743162, num_examples: 220701
- train.9h: num_bytes: 1288180, num_examples: 2110
- train.1h: num_bytes: 138734, num_examples: 233
- validation: num_bytes: 1463115, num_examples: 2408
- test: num_bytes: 1464565, num_examples: 2385
download_size: 53296894035
dataset_size: 141097756

意大利语 (italian)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 36008104, num_examples: 59623
- train.9h: num_bytes: 1325927, num_examples: 2173
- train.1h: num_bytes: 145006, num_examples: 240
- validation: num_bytes: 732210, num_examples: 1248
- test: num_bytes: 746977, num_examples: 1262
download_size: 15395281399
dataset_size: 38958224

葡萄牙语 (portuguese)

features:
- file: string
- audio: audio: sampling_rate: 16000
- text: string
- speaker_id: int64
- chapter_id: int64
- id: string
splits:
- train: num_bytes: 23036487, num_examples: 37533
- train.9h: num_bytes: 1305698, num_examples: 2116
- train.1h: num_bytes: 143781, num_examples: 236
- validation: num_bytes: 512463, num_examples: 826
- test: num_bytes: 549893, num_examples: 871
download_size: 9982803818
dataset_size: 25548322

搜集汇总

数据集介绍

构建方式

MultiLingual LibriSpeech数据集是一个多语言语音识别的语料库，由来自LibriVox的朗读有声读物构成。该数据集包括英语、德语、荷兰语、西班牙语、法语、意大利语、葡萄牙语和波兰语共8种语言。数据集的构建过程包括对有声读物的音频和文本的收集、标注，以及将音频文件进行解码和重采样，以适应不同的模型训练需求。

使用方法

使用MultiLingual LibriSpeech数据集时，首先需要选择合适的配置名称，如德语、荷兰语等。然后，根据需要选择不同的数据集分割，如训练集、验证集等。在处理音频数据时，可以直接访问解码后的音频数组，但需要注意解码和重采样可能会消耗较多时间。此外，数据集还包含文本转录和说话者标识符等信息，可用于进行语音识别、说话人识别等任务。

背景与挑战

背景概述

在语音识别领域，构建一个大规模的多语言语音数据集对于提升自动语音识别（ASR）系统的性能至关重要。MultiLingual LibriSpeech (MLS) 数据集应运而生，旨在为研究人员提供一个包含多种语言的语音识别研究平台。该数据集由 Vineel Pratap 等研究人员创建，并于 2020 年发布。MLS 数据集源自 LibriVox 项目中的朗读有声书，涵盖了英语、德语、荷兰语、西班牙语、法语、意大利语、葡萄牙语和波兰语八种语言，为多语言语音识别模型训练提供了丰富的语料资源。MLS 数据集的创建不仅丰富了语音识别研究的数据基础，也为跨语言语音识别技术的开发提供了有力支持。

当前挑战

尽管 MLS 数据集在多语言语音识别领域具有重要作用，但其构建和应用过程中仍面临诸多挑战。首先，数据集所解决的领域问题包括提高多语言语音识别系统的准确性和鲁棒性，以及实现跨语言语音识别的通用性。其次，在构建过程中，数据集的收集、标注和验证需要投入大量的人力和时间，且需要确保数据质量和多样性。此外，MLS 数据集还面临数据隐私和安全性的挑战，如确保数据中的个人信息不被泄露，以及避免在训练过程中产生偏见。最后，MLS 数据集在应用过程中需要考虑社会影响，如避免在语音识别系统中出现性别、地域等偏见。

常用场景

经典使用场景

在语音识别领域，多语言LibriSpeech数据集提供了一个强大的多语言语音识别基准。该数据集汇集了多种语言的语音样本，使得研究人员能够训练和评估能够在不同语言间无缝转换的语音识别模型。这对于构建多语言语音助手、翻译服务和语音搜索系统等应用至关重要。

解决学术问题

多语言LibriSpeech数据集解决了在多语言语音识别中普遍存在的语言依赖问题。传统上，语音识别模型需要大量的单语言数据进行训练，而多语言LibriSpeech数据集的出现使得研究人员能够在一个统一的框架下研究多语言语音识别问题，从而推动了语音识别技术的发展。此外，该数据集还促进了跨语言语音识别研究，有助于解决语音识别在不同语言间的转换问题。

实际应用

多语言LibriSpeech数据集在实际应用中，被广泛应用于语音助手、翻译服务和语音搜索系统等领域。例如，基于该数据集训练的语音识别模型可以用于构建支持多种语言的语音助手，使得用户能够用母语与语音助手进行交互。此外，该数据集还被用于构建多语言翻译服务，使得用户能够通过语音输入进行跨语言翻译。同时，该数据集还可以用于语音搜索系统，使得用户能够通过语音输入进行搜索。

数据集最近研究