five

librispeech_asr

收藏
魔搭社区2026-05-15 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/openslr/librispeech_asr
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for librispeech_asr ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [LibriSpeech ASR corpus](http://www.openslr.org/12) - **Repository:** [Needs More Information] - **Paper:** [LibriSpeech: An ASR Corpus Based On Public Domain Audio Books](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) - **Leaderboard:** [The 🤗 Speech Bench](https://huggingface.co/spaces/huggingface/hf-speech-bench) - **Point of Contact:** [Daniel Povey](mailto:dpovey@gmail.com) ### Dataset Summary LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. ### Supported Tasks and Leaderboards - `automatic-speech-recognition`, `audio-speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active Hugging Face leaderboard which can be found at https://huggingface.co/spaces/huggingface/hf-speech-bench. The leaderboard ranks models uploaded to the Hub based on their WER. An external leaderboard at https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean ranks the latest models from research and academia. ### Languages The audio is in English. There are two configurations: `clean` and `other`. The speakers in the corpus were ranked according to the WER of the transcripts of a model trained on a different dataset, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher WER speakers designated as "other". ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file, usually called `file` and its transcription, called `text`. Some additional information about the speaker and the passage which contains the transcription is provided. ``` {'chapter_id': 141231, 'file': '/home/albert/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': { 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'} ``` ### Data Fields - file: A path to the downloaded audio file in .flac format. - audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. - text: the transcription of the audio file. - id: unique id of the data sample. - speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples. - chapter_id: id of the audiobook chapter which includes the transcription. ### Data Splits The size of the corpus makes it impractical, or at least inconvenient for some users, to distribute it as a single large archive. Thus the training portion of the corpus is split into three subsets, with approximate size 100, 360 and 500 hours respectively. A simple automatic procedure was used to select the audio in the first two sets to be, on average, of higher recording quality and with accents closer to US English. An acoustic model was trained on WSJ’s si-84 data subset and was used to recognize the audio in the corpus, using a bigram LM estimated on the text of the respective books. We computed the Word Error Rate (WER) of this automatic transcript relative to our reference transcripts obtained from the book texts. The speakers in the corpus were ranked according to the WER of the WSJ model’s transcripts, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher-WER speakers designated as "other". For "clean", the data is split into train, validation, and test set. The train set is further split into train.100 and train.360 respectively accounting for 100h and 360h of the training data. For "other", the data is split into train, validation, and test set. The train set contains approximately 500h of recorded speech. | | Train.500 | Train.360 | Train.100 | Valid | Test | | ----- | ------ | ----- | ---- | ---- | ---- | | clean | - | 104014 | 28539 | 2703 | 2620| | other | 148688 | - | - | 2864 | 2939 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators The dataset was initially created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. ### Licensing Information [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### Citation Information ``` @inproceedings{panayotov2015librispeech, title={Librispeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, pages={5206--5210}, year={2015}, organization={IEEE} } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

# LibriSpeech ASR 数据集卡片(Dataset Card) ## 目录 - [数据集描述(Dataset Description)](#数据集描述) - [数据集概述(Dataset Summary)](#数据集概述) - [支持任务与排行榜(Supported Tasks and Leaderboards)](#支持任务与排行榜) - [语言(Languages)](#语言) - [数据集结构(Dataset Structure)](#数据集结构) - [数据样本(Data Instances)](#数据样本) - [数据字段(Data Fields)](#数据字段) - [数据拆分(Data Splits)](#数据拆分) - [数据集构建(Dataset Creation)](#数据集构建) - [构建依据(Curation Rationale)](#构建依据) - [源数据(Source Data)](#源数据) - [标注(Annotations)](#标注) - [个人与敏感信息(Personal and Sensitive Information)](#个人与敏感信息) - [数据集使用注意事项(Considerations for Using the Data)](#数据集使用注意事项) - [数据集社会影响(Social Impact of Dataset)](#数据集社会影响) - [偏差讨论(Discussion of Biases)](#偏差讨论) - [其他已知局限性(Other Known Limitations)](#其他已知局限性) - [附加信息(Additional Information)](#附加信息) - [数据集维护者(Dataset Curators)](#数据集维护者) - [授权信息(Licensing Information)](#授权信息) - [引用信息(Citation Information)](#引用信息) - [贡献者(Contributions)](#贡献者) ## 数据集描述 - **主页**:[LibriSpeech ASR 语料库(LibriSpeech ASR corpus)](http://www.openslr.org/12) - **仓库**:[待补充更多信息] - **论文**:[《LibriSpeech: 基于公有领域有声书的自动语音识别语料库》(LibriSpeech: An ASR Corpus Based On Public Domain Audio Books)](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) - **排行榜**:[🤗 语音基准测试集(The 🤗 Speech Bench)](https://huggingface.co/spaces/huggingface/hf-speech-bench) - **联系人**:[Daniel Povey(丹尼尔·波维)](mailto:dpovey@gmail.com) ### 数据集概述 LibriSpeech 是一个包含约1000小时16kHz英语朗读语音的语料库,由Vassil Panayotov(瓦西尔·帕纳约托夫)主导制作,Daniel Povey(丹尼尔·波维)提供协助。该数据集源自LibriVox项目的有声书朗读内容,经过了严格的分段与对齐处理。 ### 支持任务与排行榜 - `automatic-speech-recognition`(自动语音识别,Automatic Speech Recognition, ASR)、`audio-speaker-identification`(音频说话人识别):本数据集可用于训练自动语音识别模型,模型接收音频文件并将其转录为书面文本,最常用的评估指标为词错误率(Word Error Rate, WER)。本任务设有活跃的Hugging Face 模型中心(Hugging Face Hub)排行榜,地址为https://huggingface.co/spaces/huggingface/hf-speech-bench,该排行榜基于模型在WER指标上的表现对上传至模型中心的模型进行排名。此外,外部排行榜https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean 收录了来自学术界与科研界的最新模型排名。 ### 语言 音频语言为英语。该数据集包含两种配置:`clean`(清洁集)与`other`(其他集)。语料库中的说话人基于在另一数据集上训练的模型对转录文本的词错误率(WER)进行排名,大致以中位数为界,WER较低的说话人被划分为“clean”集,WER较高的则被划分为“other”集。 ## 数据集结构 ### 数据样本 典型的数据样本包含音频文件路径(通常命名为`file`)及其转录文本(命名为`text`),同时会提供与说话人及转录文本所属章节相关的额外信息。 {'chapter_id': 141231, 'file': '/home/albert/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': { 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'} ### 数据字段 - `file`:指向下载的FLAC格式音频文件的路径。 - `audio`:包含音频文件路径、解码后的音频数组以及采样率的字典。请注意,当访问音频列时,例如`dataset[0]["audio"]`,音频文件会自动解码并重采样至`dataset.features["audio"].sampling_rate`指定的采样率。批量解码与重采样大量音频文件可能会耗费大量时间,因此优先通过样本索引访问音频列,即始终推荐使用`dataset[0]["audio"]`而非`dataset["audio"][0]`。 - `text`:音频文件的转录文本。 - `id`:数据样本的唯一标识符。 - `speaker_id`:说话人的唯一标识符,同一说话人可对应多个数据样本。 - `chapter_id`:该转录文本所属的有声书章节编号。 ### 数据拆分 由于语料库规模较大,作为单个大型归档文件分发既不实际也会给部分用户带来不便,因此训练集被划分为三个子集,大致规模分别为100、360与500小时。我们使用简单的自动化流程选取前两个子集的音频,使其平均录音质量更高且口音更贴近美式英语。具体流程为:在WSJ的si-84数据子集上训练声学模型,使用基于对应书籍文本训练的二元语言模型(bigram LM),对语料库中的音频进行自动转录,再将自动转录结果与基于书籍文本得到的参考转录文本对比,计算得到词错误率(WER)。随后基于该WSJ模型生成的转录文本的WER对语料库中的说话人进行排名,大致以中位数为界,将WER较低的说话人划分为“clean”集,WER较高的则划分为“other”集。 对于“clean”集,数据被划分为训练集、验证集与测试集,其中训练集进一步拆分为train.100与train.360,分别对应100小时与360小时的训练数据。对于“other”集,数据同样被划分为训练集、验证集与测试集,其训练集包含约500小时的录音语音。 | | Train.500 | Train.360 | Train.100 | Valid | Test | | ----- | ------ | ----- | ---- | ---- | ---- | | clean | - | 104014 | 28539 | 2703 | 2620| | other | 148688 | - | - | 2864 | 2939 | ## 数据集构建 ### 构建依据 [待补充更多信息] ### 源数据 #### 初始数据收集与标准化 [待补充更多信息] #### 源语言生产者是谁? [待补充更多信息] ### 标注 #### 标注流程 [待补充更多信息] #### 标注者是谁? [待补充更多信息] ### 个人与敏感信息 本数据集由自愿在线贡献语音的人群构成,请勿尝试识别数据集中说话人的身份。 ## 数据集使用注意事项 ### 数据集社会影响 [需更多信息] ### 偏差讨论 [需更多信息] ### 其他已知局限性 [待补充更多信息] ## 附加信息 ### 数据集维护者 本数据集最初由Vassil Panayotov(瓦西尔·帕纳约托夫)、Guoguo Chen(陈国国)、Daniel Povey(丹尼尔·波维)与Sanjeev Khudanpur(桑吉夫·库丹普尔)创建。 ### 授权信息 [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### 引用信息 @inproceedings{panayotov2015librispeech, title={LibriSpeech: an ASR corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, pages={5206--5210}, year={2015}, organization={IEEE} } ### 贡献者 感谢[@patrickvonplaten](https://github.com/patrickvonplaten) 为本数据集添加支持。
提供机构:
maas
创建时间:
2025-01-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作