peoples_speech
收藏魔搭社区2026-04-28 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/MLCommons/peoples_speech
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for People's Speech
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://mlcommons.org/en/peoples-speech/
- **Repository:** https://github.com/mlcommons/peoples-speech
- **Paper:** https://arxiv.org/abs/2111.09344
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [datasets@mlcommons.org](mailto:datasets@mlcommons.org)
### Dataset Summary
The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
English
## Dataset Structure
### Data Instances
{
"id": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac",
"audio": {
"path": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac"
"array": array([-6.10351562e-05, ...]),
"sampling_rate": 16000
}
"duration_ms": 14490,
"text": "contends that the suspension clause requires a [...]"
}
### Data Fields
{
"id": datasets.Value("string"),
"audio": datasets.Audio(sampling_rate=16_000),
"duration_ms": datasets.Value("int32"),
"text": datasets.Value("string"),
}
### Data Splits
We provide the following configurations for the dataset: `cc-by-clean` (`"clean"`), `cc-by-dirty` (`"dirty"`), `cc-by-sa-clean` (`"clean_sa"`), `cc-by-sa-dirty` (`"dirty_sa"`), and `microset` (`"microset"`).
We also provide validation and test configurations, which are not only available as standalone configurations but are also included as validation and test splits within each of the above configurations for ease of use.
Specifically:
- Setting `data_dir="validation"` and `split="validation"` corresponds to the validation split of any of the configurations: `"clean"`, `"clean_sa"`, `"dirty"`, or `"dirty_sa"`.
- Similarly, setting `data_dir="test"` and `split="test"` corresponds to the test split of these configurations.
```
├── clean
│ ├── train
│ ├── validation
│ └── test
├── clean_sa
│ ├── train
│ ├── validation
│ └── test
├── dirty
│ ├── train
│ ├── validation
│ └── test
├── dirty_sa
│ ├── train
│ ├── validation
│ └── test
├── microset
│ └── train
├── validation
│ └── validation
└── test
└── test
```
## Dataset Creation
### Curation Rationale
See our [paper](https://arxiv.org/abs/2111.09344).
### Source Data
#### Initial Data Collection and Normalization
Data was downloaded via the archive.org API. No data inference was done.
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
No manual annotation is done. We download only source audio with already existing transcripts.
#### Who are the annotators?
For the test and dev sets, we paid native American English speakers to do transcriptions. We do not know the identities of the transcriptionists for data in the training set. For the training set, we have noticed that some transcriptions are likely to be the output of automatic speech recognition systems.
### Personal and Sensitive Information
Several of our sources are legal and government proceedings, spoken histories, speeches, and so on. Given that these were intended as public documents and licensed as such, it is natural that the involved individuals are aware of this.
## Considerations for Using the Data
### Social Impact of Dataset
The dataset could be used for speech synthesis. However, this requires careful cleaning of the dataset, as background noise is not tolerable for speech synthesis.
The dataset could be used for keyword spotting tasks as well. In particular, this is good use case for the non-English audio in the dataset.
Our sincere hope is that the large breadth of sources our dataset incorporates reduces existing quality of service issues today, like speech recognition system’s poor understanding of non-native English accents. We cannot think of any unfair treatment that come from using this dataset at this time.
### Discussion of Biases
Our data is downloaded from archive.org. As such, the data is biased towards whatever users decide to upload there.
Almost all of our data is American accented English.
### Other Known Limitations
As of version 1.0, a portion of data in the training, test, and dev sets is poorly aligned. Specifically, some words appear in the transcript, but not the audio, or some words appear in the audio, but not the transcript. We are working on it.
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
We provide CC-BY and CC-BY-SA subsets of the dataset.
### Citation Information
Please cite:
```
@article{DBLP:journals/corr/abs-2111-09344,
author = {Daniel Galvez and
Greg Diamos and
Juan Ciro and
Juan Felipe Cer{\'{o}}n and
Keith Achorn and
Anjali Gopi and
David Kanter and
Maximilian Lam and
Mark Mazumder and
Vijay Janapa Reddi},
title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage},
journal = {CoRR},
volume = {abs/2111.09344},
year = {2021},
url = {https://arxiv.org/abs/2111.09344},
eprinttype = {arXiv},
eprint = {2111.09344},
timestamp = {Mon, 22 Nov 2021 16:44:07 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
# 《人民之声》数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持的任务与评测榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差分析](#discussion-of-biases)
- [已知其他局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://mlcommons.org/en/peoples-speech/
- **代码仓库**:https://github.com/mlcommons/peoples-speech
- **相关论文**:https://arxiv.org/abs/2111.09344
- **评测榜**:[暂无更多信息]
- **联系方式**:[datasets@mlcommons.org](mailto:datasets@mlcommons.org)
### 数据集概述
《人民之声》数据集是当前全球规模最大的基于知识共享许可的英语语音识别语料库之一,可在知识共享署名许可(CC-BY)与知识共享署名-相同方式共享许可(CC-BY-SA)4.0协议下用于学术与商业用途。该数据集包含超过30000小时的英语转录语音,涵盖了多样化的说话人群体。这一开源数据集规模足以支撑语音转文字系统的训练,且尤为关键的是,其采用了宽松的许可协议。
### 支持的任务与评测榜
[暂无更多信息]
### 语言
英语
## 数据集结构
### 数据实例
json
{
"id": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac",
"audio": {
"path": "gov_DOT_uscourts_DOT_scotus_DOT_19-161/gov_DOT_uscourts_DOT_scotus_DOT_19-161_DOT_2020-03-02_DOT_mp3_00002.flac",
"array": array([-6.10351562e-05, ...]),
"sampling_rate": 16000
},
"duration_ms": 14490,
"text": "contends that the suspension clause requires a [...]"
}
> 说明:每个数据实例包含以下内容:
> - `id`:数据唯一标识符,字符串类型
> - `audio`:音频信息,包含文件路径、时域采样数组与采样率
> - `duration_ms`:音频时长,单位为毫秒
> - `text`:语音转录文本
### 数据字段
json
{
"id": datasets.Value("string"),
"audio": datasets.Audio(sampling_rate=16_000),
"duration_ms": datasets.Value("int32"),
"text": datasets.Value("string"),
}
> 说明:该数据集的字段定义如上,其中:
> - `id`:数据唯一标识,字符串类型
> - `audio`:音频对象,采样率固定为16000Hz
> - `duration_ms`:音频时长,单位为毫秒,32位整型
> - `text`:语音转录文本,字符串类型
### 数据划分
我们为该数据集提供了以下配置:`cc-by-clean`(`clean`)、`cc-by-dirty`(`dirty`)、`cc-by-sa-clean`(`clean_sa`)、`cc-by-sa-dirty`(`dirty_sa`)以及`microset`(`microset`)。
我们同时提供了验证集与测试集配置,这些配置既可以作为独立配置使用,也可嵌入上述所有配置的验证、测试划分中,以方便用户使用。
具体而言:
- 当设置`data_dir="validation"`且`split="validation"`时,对应任意上述配置(`"clean"`、`"clean_sa"`、`"dirty"`或`"dirty_sa"`)的验证划分。
- 类似地,设置`data_dir="test"`且`split="test"`时,对应上述配置的测试划分。
├── clean
│ ├── train
│ ├── validation
│ └── test
├── clean_sa
│ ├── train
│ ├── validation
│ └── test
├── dirty
│ ├── train
│ ├── validation
│ └── test
├── dirty_sa
│ ├── train
│ ├── validation
│ └── test
├── microset
│ └── train
├── validation
│ └── validation
└── test
└── test
## 数据集构建
### 数据集构建依据
详见我们的[相关论文](https://arxiv.org/abs/2111.09344)。
### 源数据
#### 初始数据收集与归一化
数据通过互联网档案平台(archive.org)的API下载获得,未进行任何数据推断操作。
#### 源语言发声者是谁?
[暂无更多信息]
### 标注信息
#### 标注流程
未进行人工标注,我们仅下载已带有现成转录文本的源音频数据。
#### 标注人员是谁?
对于测试集与开发集,我们付费聘请了以美式英语为母语的标注人员进行转录。我们并不知晓训练集转录人员的身份。经观察,训练集中的部分转录文本极有可能来自自动语音识别系统的输出结果。
### 个人与敏感信息
我们的部分数据源来自法律与政府程序、口述历史、演讲等公开内容。鉴于这些内容本就作为公开文档发布并获得相应许可,相关涉事人员对此应已知晓。
## 数据集使用注意事项
### 数据集的社会影响
该数据集可用于语音合成任务,但需对数据集进行细致的清洗,因为背景噪声无法适配语音合成的要求。
该数据集同样可应用于关键词检测(Keyword Spotting)任务。具体而言,其适配数据集中的非英语音频场景。
我们衷心希望,本数据集涵盖的多样化数据源能够改善当前存在的部分服务质量问题,例如语音识别系统对非母语英语口音的识别效果不佳的问题。截至目前,我们尚未发现使用该数据集会引发任何不公平待遇的情况。
### 偏差分析
我们的数据从互联网档案平台(archive.org)下载获得,因此该数据集的分布偏向于该平台上用户上传的内容。
本数据集几乎全部为带有美国口音的英语语音。
### 已知其他局限性
在1.0版本中,训练集、测试集与开发集中的部分数据存在对齐不佳的问题。具体表现为:部分转录文本中的词汇未出现在音频中,或音频中的词汇未出现在转录文本中。我们正在修复该问题。
## 附加信息
### 数据集维护者
[暂无更多信息]
### 许可信息
我们为数据集提供了知识共享署名许可(CC-BY)与知识共享署名-相同方式共享许可(CC-BY-SA)两个子集的协议。
### 引用信息
请引用如下文献:
bibtex
@article{DBLP:journals/corr/abs-2111.09344,
author = {Daniel Galvez and
Greg Diamos and
Juan Ciro and
Juan Felipe Cer{"{o}}n and
Keith Achorn and
Anjali Gopi and
David Kanter and
Maximilian Lam and
Mark Mazumder and
Vijay Janapa Reddi},
title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage},
journal = {CoRR},
volume = {abs/2111.09344},
year = {2021},
url = {https://arxiv.org/abs/2111.09344},
eprinttype = {arXiv},
eprint = {2111.09344},
timestamp = {Mon, 22 Nov 2021 16:44:07 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2111.09344.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
> 注:CoRR为计算机研究知识库(Computing Research Repository)的缩写。
提供机构:
maas
创建时间:
2025-02-09



