MLCommons/peoples_speech

Name: MLCommons/peoples_speech
Creator: MLCommons
Published: 2024-11-20 15:17:45
License: 暂无描述

Hugging Face2024-11-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MLCommons/peoples_speech

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - machine-generated language_creators: - crowdsourced - machine-generated language: - en license: - cc-by-2.0 - cc-by-2.5 - cc-by-3.0 - cc-by-4.0 - cc-by-sa-3.0 - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1T<n source_datasets: - original task_categories: - automatic-speech-recognition task_ids: [] pretty_name: People's Speech tags: - robust-speech-recognition - noisy-speech-recognition - speech-recognition dataset_info: - config_name: clean features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 401733771186.124 num_examples: 1501271 - name: validation num_bytes: 2459781412.24 num_examples: 18622 - name: test num_bytes: 4324307722.96 num_examples: 34898 download_size: 398550700437 dataset_size: 408517860321.32404 - config_name: clean_sa features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 75267509124.558 num_examples: 257093 - name: validation num_bytes: 2075929254.254 num_examples: 18622 - name: test num_bytes: 3894954757.41 num_examples: 34898 download_size: 72518549222 dataset_size: 81238393136.222 - config_name: dirty features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 1569500875399.994 num_examples: 5476898 - name: validation num_bytes: 2641406179.2539997 num_examples: 18622 - name: test num_bytes: 5097236056.41 num_examples: 34898 download_size: 1496747948260 dataset_size: 1577239517635.6577 - config_name: dirty_sa features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 163776914241.91 num_examples: 548014 - name: validation num_bytes: 2075929254.254 num_examples: 18622 - name: test num_bytes: 3894954757.41 num_examples: 34898 download_size: 149326092074 dataset_size: 169747798253.574 - config_name: microset features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: train num_bytes: 92397066.0 num_examples: 336 download_size: 90204303 dataset_size: 92397066.0 - config_name: test features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: test num_bytes: 3894954757.41 num_examples: 34898 download_size: 4087772459 dataset_size: 3894954757.41 - config_name: validation features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: duration_ms dtype: int32 - name: text dtype: string splits: - name: validation num_bytes: 2075929254.254 num_examples: 18622 download_size: 2335244149 dataset_size: 2075929254.254 configs: - config_name: clean data_files: - split: train path: clean/train-* - split: validation path: clean/validation-* - split: test path: clean/test-* - config_name: clean_sa data_files: - split: train path: clean_sa/train-* - split: validation path: clean_sa/validation-* - split: test path: clean_sa/test-* - config_name: dirty data_files: - split: train path: dirty/train-* - split: validation path: dirty/validation-* - split: test path: dirty/test-* - config_name: dirty_sa data_files: - split: train path: dirty_sa/train-* - split: validation path: dirty_sa/validation-* - split: test path: dirty_sa/test-* - config_name: microset data_files: - split: train path: microset/train-* - config_name: test data_files: - split: test path: test/test-* - config_name: validation data_files: - split: validation path: validation/validation-* --- # Dataset Card for People's Speech ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://mlcommons.org/en/peoples-speech/ - **Repository:** https://github.com/mlcommons/peoples-speech - **Paper:** https://arxiv.org/abs/2111.09344 - **Leaderboard:** [Needs More Information] - **Point of Contact:** [datasets@mlcommons.org](mailto:datasets@mlcommons.org) ### Dataset Summary The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license. ### Supported Tasks and Leaderboards [Needs More Information] ### Languages English ## Dataset Structure ### Data Instances { "id": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac", "audio": { "path": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac" "array": array([-6.10351562e-05, ...]), "sampling_rate": 16000 } "duration_ms": 14490, "text": "contends that the suspension clause requires a [...]" } ### Data Fields { "id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=16_000), "duration_ms": datasets.Value("int32"), "text": datasets.Value("string"), } ### Data Splits We provide the following configurations for the dataset: `cc-by-clean` (`"clean"`), `cc-by-dirty` (`"dirty"`), `cc-by-sa-clean` (`"clean_sa"`), `cc-by-sa-dirty` (`"dirty_sa"`), and `microset` (`"microset"`). We also provide validation and test configurations, which are not only available as standalone configurations but are also included as validation and test splits within each of the above configurations for ease of use. Specifically: - Setting `data_dir="validation"` and `split="validation"` corresponds to the validation split of any of the configurations: `"clean"`, `"clean_sa"`, `"dirty"`, or `"dirty_sa"`. - Similarly, setting `data_dir="test"` and `split="test"` corresponds to the test split of these configurations. ``` ├── clean │ ├── train │ ├── validation │ └── test ├── clean_sa │ ├── train │ ├── validation │ └── test ├── dirty │ ├── train │ ├── validation │ └── test ├── dirty_sa │ ├── train │ ├── validation │ └── test ├── microset │ └── train ├── validation │ └── validation └── test └── test ``` ## Dataset Creation ### Curation Rationale See our [paper](https://arxiv.org/abs/2111.09344). ### Source Data #### Initial Data Collection and Normalization Data was downloaded via the archive.org API. No data inference was done. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process No manual annotation is done. We download only source audio with already existing transcripts. #### Who are the annotators? For the test and dev sets, we paid native American English speakers to do transcriptions. We do not know the identities of the transcriptionists for data in the training set. For the training set, we have noticed that some transcriptions are likely to be the output of automatic speech recognition systems. ### Personal and Sensitive Information Several of our sources are legal and government proceedings, spoken histories, speeches, and so on. Given that these were intended as public documents and licensed as such, it is natural that the involved individuals are aware of this. ## Considerations for Using the Data ### Social Impact of Dataset The dataset could be used for speech synthesis. However, this requires careful cleaning of the dataset, as background noise is not tolerable for speech synthesis. The dataset could be used for keyword spotting tasks as well. In particular, this is good use case for the non-English audio in the dataset. Our sincere hope is that the large breadth of sources our dataset incorporates reduces existing quality of service issues today, like speech recognition system’s poor understanding of non-native English accents. We cannot think of any unfair treatment that come from using this dataset at this time. ### Discussion of Biases Our data is downloaded from archive.org. As such, the data is biased towards whatever users decide to upload there. Almost all of our data is American accented English. ### Other Known Limitations As of version 1.0, a portion of data in the training, test, and dev sets is poorly aligned. Specifically, some words appear in the transcript, but not the audio, or some words appear in the audio, but not the transcript. We are working on it. ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information We provide CC-BY and CC-BY-SA subsets of the dataset. ### Citation Information Please cite: ``` @article{DBLP:journals/corr/abs-2111-09344, author = {Daniel Galvez and Greg Diamos and Juan Ciro and Juan Felipe Cer{\'{o}}n and Keith Achorn and Anjali Gopi and David Kanter and Maximilian Lam and Mark Mazumder and Vijay Janapa Reddi}, title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage}, journal = {CoRR}, volume = {abs/2111.09344}, year = {2021}, url = {https://arxiv.org/abs/2111.09344}, eprinttype = {arXiv}, eprint = {2111.09344}, timestamp = {Mon, 22 Nov 2021 16:44:07 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```

提供机构：

MLCommons

原始信息汇总

数据集卡片 for Peoples Speech

数据集描述

数据集摘要

Peoples Speech 数据集是当今世界上最大的英语语音识别语料库之一，适用于学术和商业用途，采用 CC-BY-SA 和 CC-BY 4.0 许可。它包含超过 30,000 小时的英语转录语音，涵盖多样化的说话者。这个开放数据集足够大，可以用于训练语音到文本系统，并且关键的是，它具有宽松的许可。

支持的任务和排行榜

[需要更多信息]

语言

英语

数据集结构

数据实例

json { "id": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac", "audio": { "path": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac", "array": array([-6.10351562e-05, ...]), "sampling_rate": 16000 }, "duration_ms": 14490, "text": "contends that the suspension clause requires a [...]" }

数据字段

json { "id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=16_000), "duration_ms": datasets.Value("int32"), "text": datasets.Value("string"), }

数据分割

我们为数据集提供了以下配置：cc-by-clean、cc-by-dirty、cc-by-sa-clean、cc-by-sa-dirty 和 microset。我们不为任何配置提供分割。

数据集创建

策划理由

参见我们的论文。

源数据

初始数据收集和规范化

数据通过 archive.org API 下载。没有进行数据推断。

源语言生产者是谁？

[需要更多信息]

注释

注释过程

没有进行手动注释。我们只下载已经存在转录的源音频。

注释者是谁？

对于测试和开发集，我们聘请了母语为美国英语的人进行转录。我们不知道训练集中转录者的身份。对于训练集，我们注意到一些转录很可能是自动语音识别系统的输出。

个人和敏感信息

我们的多个来源是法律和政府程序、口述历史、演讲等。鉴于这些文件旨在作为公共文档并以此许可，自然而然地，相关个人是知晓这一点的。

使用数据集的注意事项

数据集的社会影响

该数据集可用于语音合成。然而，这需要仔细清理数据集，因为背景噪音对于语音合成是不可容忍的。

该数据集也可用于关键词发现任务。特别是，这对于数据集中非英语音频是一个很好的用例。

我们真诚地希望，我们数据集所包含的大量来源能够减少当今存在的质量服务问题，例如语音识别系统对非母语英语口音的理解不佳。目前，我们无法想到使用此数据集会带来任何不公平的待遇。

偏见的讨论

我们的数据从 archive.org 下载。因此，数据偏向于用户决定上传到那里的内容。

几乎所有数据都是美国口音的英语。

其他已知限制

截至 1.0 版本，训练集、测试集和开发集中的一部分数据对齐不佳。具体来说，一些单词出现在转录中，但没有出现在音频中，或者一些单词出现在音频中，但没有出现在转录中。我们正在努力解决这个问题。

附加信息

数据集策展人

[需要更多信息]

许可信息

我们提供了 CC-BY 和 CC-BY-SA 子集的数据集。

引用信息

请引用：

@article{DBLP:journals/corr/abs-2111-09344, author = {Daniel Galvez and Greg Diamos and Juan Ciro and Juan Felipe Cer{{o}}n and Keith Achorn and Anjali Gopi and David Kanter and Maximilian Lam and Mark Mazumder and Vijay Janapa Reddi}, title = {The Peoples Speech: {A} Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage}, journal = {CoRR}, volume = {abs/2111.09344}, year = {2021}, url = {https://arxiv.org/abs/2111.09344}, eprinttype = {arXiv}, eprint = {2111.09344}, timestamp = {Mon, 22 Nov 2021 16:44:07 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

搜集汇总

数据集介绍

背景与挑战

背景概述

People's Speech数据集是世界上最大的英语语音识别语料库之一，包含超过30,000小时的转录语音，覆盖多种说话者和场景，适用于自动语音识别任务。该数据集采用CC-BY和CC-BY-SA许可，允许学术和商业使用，并包含多个子集和分割以支持模型训练和评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集