sovitrath/librispeech_asr

Name: sovitrath/librispeech_asr
Creator: sovitrath
Published: 2026-04-19 05:17:55
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/sovitrath/librispeech_asr

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: LibriSpeech annotations_creators: - expert-generated language_creators: - crowdsourced - expert-generated language: - en license: - cc-by-4.0 multilinguality: - monolingual paperswithcode_id: librispeech-1 size_categories: - 100K<n<1M source_datasets: - original task_categories: - automatic-speech-recognition - audio-classification task_ids: - speaker-identification dataset_info: - config_name: clean features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train.100 num_bytes: 6619683041 num_examples: 28539 - name: train.360 num_bytes: 23898214592 num_examples: 104014 - name: validation num_bytes: 359572231 num_examples: 2703 - name: test num_bytes: 367705423 num_examples: 2620 download_size: 30121377654 dataset_size: 31245175287 - config_name: other features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train.500 num_bytes: 31810256902 num_examples: 148688 - name: validation num_bytes: 337283304 num_examples: 2864 - name: test num_bytes: 352396474 num_examples: 2939 download_size: 31236565377 dataset_size: 32499936680 - config_name: all features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: train.clean.100 num_bytes: 6627791685 num_examples: 28539 - name: train.clean.360 num_bytes: 23927767570 num_examples: 104014 - name: train.other.500 num_bytes: 31852502880 num_examples: 148688 - name: validation.clean num_bytes: 359505691 num_examples: 2703 - name: validation.other num_bytes: 337213112 num_examples: 2864 - name: test.clean num_bytes: 368449831 num_examples: 2620 - name: test.other num_bytes: 353231518 num_examples: 2939 download_size: 61357943031 dataset_size: 63826462287 configs: - config_name: clean data_files: - split: test path: "clean/test/*.parquet" - split: train.100 path: "clean/train.100/*.parquet" - split: train.360 path: "clean/train.360/*.parquet" - split: validation path: "clean/validation/*.parquet" - config_name: other data_files: - split: test path: "other/test/*.parquet" - split: train.500 path: "other/train.500/*.parquet" - split: validation path: "other/validation/*.parquet" - config_name: all default: true data_files: - split: test.clean path: "all/test.clean/*.parquet" - split: test.other path: "all/test.other/*.parquet" - split: train.clean.100 path: "all/train.clean.100/*.parquet" - split: train.clean.360 path: "all/train.clean.360/*.parquet" - split: train.other.500 path: "all/train.other.500/*.parquet" - split: validation.clean path: "all/validation.clean/*.parquet" - split: validation.other path: "all/validation.other/*.parquet" --- # Dataset Card for librispeech_asr ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [LibriSpeech ASR corpus](http://www.openslr.org/12) - **Repository:** [Needs More Information] - **Paper:** [LibriSpeech: An ASR Corpus Based On Public Domain Audio Books](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) - **Leaderboard:** [The 🤗 Speech Bench](https://huggingface.co/spaces/huggingface/hf-speech-bench) - **Point of Contact:** [Daniel Povey](mailto:dpovey@gmail.com) ### Dataset Summary LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. ### Supported Tasks and Leaderboards - `automatic-speech-recognition`, `audio-speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active Hugging Face leaderboard which can be found at https://huggingface.co/spaces/huggingface/hf-speech-bench. The leaderboard ranks models uploaded to the Hub based on their WER. An external leaderboard at https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean ranks the latest models from research and academia. ### Languages The audio is in English. There are two configurations: `clean` and `other`. The speakers in the corpus were ranked according to the WER of the transcripts of a model trained on a different dataset, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher WER speakers designated as "other". ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file, usually called `file` and its transcription, called `text`. Some additional information about the speaker and the passage which contains the transcription is provided. ``` {'chapter_id': 141231, 'file': '/home/albert/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': { 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'} ``` ### Data Fields - file: A path to the downloaded audio file in .flac format. - audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. - text: the transcription of the audio file. - id: unique id of the data sample. - speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples. - chapter_id: id of the audiobook chapter which includes the transcription. ### Data Splits The size of the corpus makes it impractical, or at least inconvenient for some users, to distribute it as a single large archive. Thus the training portion of the corpus is split into three subsets, with approximate size 100, 360 and 500 hours respectively. A simple automatic procedure was used to select the audio in the first two sets to be, on average, of higher recording quality and with accents closer to US English. An acoustic model was trained on WSJ’s si-84 data subset and was used to recognize the audio in the corpus, using a bigram LM estimated on the text of the respective books. We computed the Word Error Rate (WER) of this automatic transcript relative to our reference transcripts obtained from the book texts. The speakers in the corpus were ranked according to the WER of the WSJ model’s transcripts, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher-WER speakers designated as "other". For "clean", the data is split into train, validation, and test set. The train set is further split into train.100 and train.360 respectively accounting for 100h and 360h of the training data. For "other", the data is split into train, validation, and test set. The train set contains approximately 500h of recorded speech. | | Train.500 | Train.360 | Train.100 | Valid | Test | | ----- | ------ | ----- | ---- | ---- | ---- | | clean | - | 104014 | 28539 | 2703 | 2620| | other | 148688 | - | - | 2864 | 2939 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators The dataset was initially created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. ### Licensing Information [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### Citation Information ``` @inproceedings{panayotov2015librispeech, title={Librispeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, pages={5206--5210}, year={2015}, organization={IEEE} } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

数据集名称：LibriSpeech 标注创作者：专家生成语言创建者：众包、专家生成语言：英语许可证：CC BY 4.0 多语言属性：单语言 PapersWithCode ID：librispeech-1 样本规模类别：10万<样本数<100万源数据集：原始数据集任务类别：自动语音识别（automatic-speech-recognition）、音频分类任务子类型：说话人识别（speaker-identification）数据集信息： - 配置名称：clean 特征字段： - 字段名：file，数据类型：字符串 - 字段名：audio，数据类型：音频对象，采样率：16000Hz - 字段名：text，数据类型：字符串 - 字段名：speaker_id，数据类型：64位整数 - 字段名：chapter_id，数据类型：64位整数 - 字段名：id，数据类型：字符串数据分割： - 分割名称：train.100，字节数：6619683041，样本数：28539 - 分割名称：train.360，字节数：23898214592，样本数：104014 - 分割名称：validation，字节数：359572231，样本数：2703 - 分割名称：test，字节数：367705423，样本数：2620 下载总大小：30121377654，数据集总存储大小：31245175287 - 配置名称：other 特征字段： - 字段名：file，数据类型：字符串 - 字段名：audio，数据类型：音频对象，采样率：16000Hz - 字段名：text，数据类型：字符串 - 字段名：speaker_id，数据类型：64位整数 - 字段名：chapter_id，数据类型：64位整数 - 字段名：id，数据类型：字符串数据分割： - 分割名称：train.500，字节数：31810256902，样本数：148688 - 分割名称：validation，字节数：337283304，样本数：2864 - 分割名称：test，字节数：352396474，样本数：2939 下载总大小：31236565377，数据集总存储大小：32499936680 - 配置名称：all 特征字段： - 字段名：file，数据类型：字符串 - 字段名：audio，数据类型：音频对象，采样率：16000Hz - 字段名：text，数据类型：字符串 - 字段名：speaker_id，数据类型：64位整数 - 字段名：chapter_id，数据类型：64位整数 - 字段名：id，数据类型：字符串数据分割： - 分割名称：train.clean.100，字节数：6627791685，样本数：28539 - 分割名称：train.clean.360，字节数：23927767570，样本数：104014 - 分割名称：train.other.500，字节数：31852502880，样本数：148688 - 分割名称：validation.clean，字节数：359505691，样本数：2703 - 分割名称：validation.other，字节数：337213112，样本数：2864 - 分割名称：test.clean，字节数：368449831，样本数：2620 - 分割名称：test.other，字节数：353231518，样本数：2939 下载总大小：61357943031，数据集总存储大小：63826462287 配置信息： - 配置名称：clean，数据文件： - 分割：test，路径："clean/test/*.parquet" - 分割：train.100，路径："clean/train.100/*.parquet" - 分割：train.360，路径："clean/train.360/*.parquet" - 分割：validation，路径："clean/validation/*.parquet" - 配置名称：other，数据文件： - 分割：test，路径："other/test/*.parquet" - 分割：train.500，路径："other/train.500/*.parquet" - 分割：validation，路径："other/validation/*.parquet" - 配置名称：all，默认配置：是，数据文件： - 分割：test.clean，路径："all/test.clean/*.parquet" - 分割：test.other，路径："all/test.other/*.parquet" - 分割：train.clean.100，路径："all/train.clean.100/*.parquet" - 分割：train.clean.360，路径："all/train.clean.360/*.parquet" - 分割：train.other.500，路径："all/train.other.500/*.parquet" - 分割：validation.clean，路径："all/validation.clean/*.parquet" - 分割：validation.other，路径："all/validation.other/*.parquet" # LibriSpeech自动语音识别语料库数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持任务与排行榜](#支持任务与排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据样本](#数据样本) - [数据字段](#数据字段) - [数据分割](#数据分割) - [数据集构建](#数据集构建) - [数据集整理初衷](#数据集整理初衷) - [源数据](#源数据) - [标注](#标注) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限](#其他已知局限) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可证信息](#许可证信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页**：[LibriSpeech自动语音识别语料库](http://www.openslr.org/12) - **代码仓库**：[暂无更多信息] - **论文**：[《LibriSpeech：基于公有领域有声书的自动语音识别语料库》](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) - **排行榜**：[🤗 语音基准测试平台](https://huggingface.co/spaces/huggingface/hf-speech-bench) - **联络人**：[Daniel Povey](mailto:dpovey@gmail.com) ### 数据集摘要 LibriSpeech是一个包含约1000小时16kHz英语朗读语音的语料库，由Vassil Panayotov在Daniel Povey的协助下构建。该数据源自LibriVox项目的公有领域有声书朗读内容，经过精心分段与对齐处理。 ### 支持任务与排行榜 - `automatic-speech-recognition`、`audio-speaker-identification`：本数据集可用于训练自动语音识别（Automatic Speech Recognition, ASR）模型，模型接收音频文件并将其转录为书面文本，最常用的评估指标为词错误率（Word Error Rate, WER）。本任务设有活跃的Hugging Face排行榜，可访问https://huggingface.co/spaces/huggingface/hf-speech-bench，该排行榜基于模型在Hub上提交的WER结果进行排名。另有外部排行榜https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean，用于排名来自科研与学术领域的最新模型。 ### 语言音频语言为英语。该数据集包含两个配置：`clean`（干净子集）与`other`（其他子集）。语料库中的说话人基于在另一数据集上训练的模型生成的转录文本的词错误率进行排序，并大致以中位数划分：低WER的说话人被归类为“clean”，高WER的则归类为“other”。 ## 数据集结构 ### 数据样本典型的数据样本包含音频文件路径（通常命名为`file`）及其转录文本（命名为`text`），同时提供与说话人及转录文本所属章节相关的附加信息。 {'chapter_id': 141231, 'file': '/home/albert/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': { 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'} ### 数据字段 - `file`：指向下载的FLAC格式音频文件的路径。 - `audio`：包含下载音频文件路径、解码后的音频数组以及采样率的字典。请注意，当访问`dataset[0]["audio"]`时，音频文件会自动解码并重采样至`dataset.features["audio"].sampling_rate`指定的采样率。批量解码与重采样大量音频文件可能耗时较长，因此建议优先通过样本索引访问，即始终优先使用`dataset[0]["audio"]`而非`dataset["audio"][0]`。 - `text`：音频文件的转录文本。 - `id`：数据样本的唯一标识符。 - `speaker_id`：说话人的唯一标识符，同一说话人ID可对应多个数据样本。 - `chapter_id`：转录文本所属有声书章节的ID。 ### 数据分割由于语料库规模较大，以单个大归档文件分发并不实用，甚至对部分用户而言操作不便，因此训练子集被划分为三个子集，大致时长分别为100小时、360小时与500小时。我们采用简单的自动流程选择前两个子集的音频，使其平均录音质量更高，且口音更贴近美式英语。我们基于WSJ的si-84数据集子集训练了一个声学模型，使用基于对应书籍文本训练的二元语言模型（bigram LM），对语料库中的音频进行识别。我们计算了自动转录结果相对于参考转录文本（源自书籍原文）的词错误率（WER）。语料库中的说话人基于该WSJ模型生成的转录文本的WER进行排序，并大致以中位数划分：低WER的说话人被归类为“clean”，高WER的则归类为“other”。对于“clean”配置，数据被划分为训练集、验证集与测试集，其中训练集进一步拆分为`train.100`与`train.360`，分别对应100小时与360小时的训练数据。对于“other”配置，数据同样划分为训练集、验证集与测试集，训练集包含约500小时的录音语音。 | | 训练集500小时 | 训练集360小时 | 训练集100小时 | 验证集 | 测试集 | | :-------------------------- | :-----------: | :-----------: | :-----------: | :----: | :----: | | clean 子集 | - | 104014 | 28539 | 2703 | 2620 | | other 子集 | 148688 | - | - | 2864 | 2939 | ## 数据集构建 ### 数据集整理初衷 [暂无更多信息] ### 源数据 #### 初始数据收集与标准化 [暂无更多信息] #### 源语言生产者是谁？ [暂无更多信息] ### 标注 #### 标注流程 [暂无更多信息] #### 标注者是谁？ [暂无更多信息] ### 个人与敏感信息该数据集由在线捐赠语音的民众组成，请勿尝试识别数据集中的说话人身份。 ## 数据使用注意事项 ### 数据集的社会影响 [需更多信息] ### 偏差讨论 [需更多信息] ### 其他已知局限 [暂无更多信息] ## 附加信息 ### 数据集维护者该数据集最初由Vassil Panayotov、Guoguo Chen、Daniel Povey与Sanjeev Khudanpur创建。 ### 许可证信息 [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### 引用信息 @inproceedings{panayotov2015librispeech, title={LibriSpeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, pages={5206--5210}, year={2015}, organization={IEEE} } ### 贡献者感谢[@patrickvonplaten](https://github.com/patrickvonplaten) 添加本数据集。

提供机构：

sovitrath

5,000+

优质数据集

54 个

任务类型

进入经典数据集