speech_commands
收藏魔搭社区2026-05-06 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/speech_commands
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SpeechCommands
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://www.tensorflow.org/datasets/catalog/speech_commands
- **Repository:** [More Information Needed]
- **Paper:** [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/pdf/1804.03209.pdf)
- **Leaderboard:** [More Information Needed]
- **Point of Contact:** Pete Warden, petewarden@google.com
### Dataset Summary
This is a set of one-second .wav audio files, each containing a single spoken
English word or background noise. These words are from a small set of commands, and are spoken by a
variety of different speakers. This data set is designed to help train simple
machine learning models. It is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209).
Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains
64,727 audio files.
Version 0.02 of the data set (configuration `"v0.02"`) was released on April 11th 2018 and
contains 105,829 audio files.
### Supported Tasks and Leaderboards
* `keyword-spotting`: the dataset can be used to train and evaluate keyword
spotting systems. The task is to detect preregistered keywords by classifying utterances
into a predefined set of words. The task is usually performed on-device for the
fast response time. Thus, accuracy, model size, and inference time are all crucial.
### Languages
The language data in SpeechCommands is in English (BCP-47 `en`).
## Dataset Structure
### Data Instances
Example of a core word (`"label"` is a word, `"is_unknown"` is `False`):
```python
{
"file": "no/7846fd85_nohash_0.wav",
"audio": {
"path": "no/7846fd85_nohash_0.wav",
"array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346,
0.00091553, 0.00079346]),
"sampling_rate": 16000
},
"label": 1, # "no"
"is_unknown": False,
"speaker_id": "7846fd85",
"utterance_id": 0
}
```
Example of an auxiliary word (`"label"` is a word, `"is_unknown"` is `True`)
```python
{
"file": "tree/8b775397_nohash_0.wav",
"audio": {
"path": "tree/8b775397_nohash_0.wav",
"array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658,
0.00335693, 0.0005188]),
"sampling_rate": 16000
},
"label": 28, # "tree"
"is_unknown": True,
"speaker_id": "1b88bf70",
"utterance_id": 0
}
```
Example of background noise (`_silence_`) class:
```python
{
"file": "_silence_/doing_the_dishes.wav",
"audio": {
"path": "_silence_/doing_the_dishes.wav",
"array": array([ 0. , 0. , 0. , ..., -0.00592041,
-0.00405884, -0.00253296]),
"sampling_rate": 16000
},
"label": 30, # "_silence_"
"is_unknown": False,
"speaker_id": "None",
"utterance_id": 0 # doesn't make sense here
}
```
### Data Fields
* `file`: relative audio filename inside the original archive.
* `audio`: dictionary containing a relative audio filename,
a decoded audio array, and the sampling rate. Note that when accessing
the audio column: `dataset[0]["audio"]` the audio is automatically decoded
and resampled to `dataset.features["audio"].sampling_rate`.
Decoding and resampling of a large number of audios might take a significant
amount of time. Thus, it is important to first query the sample index before
the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred
over `dataset["audio"][0]`.
* `label`: either word pronounced in an audio sample or background noise (`_silence_`) class.
Note that it's an integer value corresponding to the class name.
* `is_unknown`: if a word is auxiliary. Equals to `False` if a word is a core word or `_silence_`,
`True` if a word is an auxiliary word.
* `speaker_id`: unique id of a speaker. Equals to `None` if label is `_silence_`.
* `utterance_id`: incremental id of a word utterance within the same speaker.
### Data Splits
The dataset has two versions (= configurations): `"v0.01"` and `"v0.02"`. `"v0.02"`
contains more words (see section [Source Data](#source-data) for more details).
| | train | validation | test |
|----- |------:|-----------:|-----:|
| v0.01 | 51093 | 6799 | 3081 |
| v0.02 | 84848 | 9982 | 4890 |
Note that in train and validation sets examples of `_silence_` class are longer than 1 second.
You can use the following code to sample 1-second examples from the longer ones:
```python
def sample_noise(example):
# Use this function to extract random 1 sec slices of each _silence_ utterance,
# e.g. inside `torch.utils.data.Dataset.__getitem__()`
from random import randint
if example["label"] == "_silence_":
random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]
return example
```
## Dataset Creation
### Curation Rationale
The primary goal of the dataset is to provide a way to build and test small
models that can detect a single word from a set of target words and differentiate it
from background noise or unrelated speech with as few false positives as possible.
### Source Data
#### Initial Data Collection and Normalization
The audio files were collected using crowdsourcing, see
[aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section)
for some of the open source audio collection code that was used. The goal was to gather examples of
people speaking single-word commands, rather than conversational sentences, so
they were prompted for individual words over the course of a five minute
session.
In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine",
"Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".
In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual".
In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation
it is marked by `True` value of `"is_unknown"` feature). Their function is to teach a model to distinguish core words
from unrecognized ones.
The `_silence_` label contains a set of longer audio clips that are either recordings or
a mathematical simulation of noise.
#### Who are the source language producers?
The audio files were collected using crowdsourcing.
### Annotations
#### Annotation process
Labels are the list of words prepared in advances.
Speakers were prompted for individual words over the course of a five minute
session.
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Creative Commons BY 4.0 License ((CC-BY-4.0)[https://creativecommons.org/licenses/by/4.0/legalcode]).
### Citation Information
```
@article{speechcommandsv2,
author = { {Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}
```
### Contributions
Thanks to [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.
# SpeechCommands 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建动因](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:https://www.tensorflow.org/datasets/catalog/speech_commands
- **代码仓库**:[更多信息待补充]
- **相关论文**:[《Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition》](https://arxiv.org/pdf/1804.03209.pdf)
- **排行榜**:[更多信息待补充]
- **联系人**:Pete Warden,petewarden@google.com
### 数据集概述
本数据集包含若干时长为1秒的.wav音频文件,每个文件对应单句英语口语词汇或背景噪声。这些词汇均来自预设的命令词集合,由多名不同说话人录制。本数据集旨在助力简单机器学习模型的训练,详细说明可参见[https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209)。
数据集0.01版(配置项`"v0.01"`)于2017年8月3日发布,共包含64,727个音频文件。数据集0.02版(配置项`"v0.02"`)于2018年4月11日发布,共包含105,829个音频文件。
### 支持任务与排行榜
* `keyword-spotting`:关键词检测(keyword spotting)任务。本数据集可用于训练与评估关键词检测系统,任务核心为将语音样本分类至预设词汇集合,从而识别预注册关键词。该任务通常部署在端侧以实现快速响应,因此准确率、模型尺寸与推理时延均为关键指标。
### 语言
本数据集的语言为英语(BCP-47 标识`en`)。
## 数据集结构
### 数据实例
核心词汇示例(`"label"`为对应词汇,`"is_unknown"`取值为`False`):
python
{
"file": "no/7846fd85_nohash_0.wav",
"audio": {
"path": "no/7846fd85_nohash_0.wav",
"array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346,
0.00091553, 0.00079346]),
"sampling_rate": 16000
},
"label": 1, # "no"
"is_unknown": False,
"speaker_id": "7846fd85",
"utterance_id": 0
}
辅助词汇示例(`"label"`为对应词汇,`"is_unknown"`取值为`True`):
python
{
"file": "tree/8b775397_nohash_0.wav",
"audio": {
"path": "tree/8b775397_nohash_0.wav",
"array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658,
0.00335693, 0.0005188]),
"sampling_rate": 16000
},
"label": 28, # "tree"
"is_unknown": True,
"speaker_id": "1b88bf70",
"utterance_id": 0
}
背景噪声(`_silence_`)类别示例:
python
{
"file": "_silence_/doing_the_dishes.wav",
"audio": {
"path": "_silence_/doing_the_dishes.wav",
"array": array([ 0. , 0. , 0. , ..., -0.00592041,
-0.00405884, -0.00253296]),
"sampling_rate": 16000
},
"label": 30, # "_silence_"
"is_unknown": False,
"speaker_id": "None",
"utterance_id": 0 # 此处无实际意义
}
### 数据字段
* `file`:原始归档内的音频文件相对路径。
* `audio`:包含音频相对路径、解码后的音频数组与采样率的字典。需注意,当访问`dataset[0]["audio"]`时,音频会自动完成解码与重采样,对齐至`dataset.features["audio"].sampling_rate`指定的采样率。对大量音频进行解码与重采样会耗费较多时间,因此推荐优先通过样本索引访问音频列,例如`dataset[0]["audio"]`的访问方式远优于`dataset["audio"][0]`。
* `label`:音频样本对应的发音词汇或背景噪声(`_silence_`)类别,为与类别名对应的整数值。
* `is_unknown`:标记词汇是否为辅助词。若为核心词汇或`_silence_`类别则取值为`False`,若为辅助词汇则取值为`True`。
* `speaker_id`:说话人的唯一标识,若标签为`_silence_`则取值为`None`。
* `utterance_id`:同一说话人下的单词语音样本的递增编号。
### 数据划分
本数据集包含两个版本(即配置项):`"v0.01"`与`"v0.02"`。其中`"v0.02"`版本包含更多词汇(详见[源数据](#source-data)章节)。
| | 训练集 | 验证集 | 测试集 |
|----- |------:|-----------:|-----:|
| v0.01 | 51093 | 6799 | 3081 |
| v0.02 | 84848 | 9982 | 4890 |
需注意,训练集与验证集中的`_silence_`类样本时长超过1秒。可使用以下代码从长样本中截取1秒时长的片段:
python
def sample_noise(example):
# Use this function to extract random 1 sec slices of each _silence_ utterance,
# e.g. inside `torch.utils.data.Dataset.__getitem__()`
from random import randint
if example["label"] == "_silence_":
random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]
return example
## 数据集构建
### 构建动因
本数据集的核心构建目标为提供一种方法,用于开发与测试小型模型,使其能够从预设目标词汇集合中识别单词语音,并尽可能降低将背景噪声或无关语音误判为目标词汇的概率。
### 源数据
#### 初始数据收集与标准化
音频文件通过众包方式收集,部分开源音频采集代码可参见[aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section)。本次采集的目标为获取单命令词的语音样本,而非会话语句,因此采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。
0.01版数据集包含30个录制词汇:"Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"、"Zero"、"One"、"Two"、"Three"、"Four"、"Five"、"Six"、"Seven"、"Eight"、"Nine"、"Bed"、"Bird"、"Cat"、"Dog"、"Happy"、"House"、"Marvin"、"Sheila"、"Tree"、"Wow"。
0.02版数据集新增了以下词汇:"Backward"、"Forward"、"Follow"、"Learn"、"Visual"。
在两个版本中,约定俗成的10个词汇被用作命令词:"Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"。其余词汇均被视为辅助词(在当前实现中通过`"is_unknown"`特征取值为`True`标记),其作用为辅助模型区分核心词汇与未识别词汇。
`_silence_`标签对应的类别包含若干较长的音频片段,这些片段为真实环境噪声录音或数学模拟噪声。
#### 源语言录制者
音频文件均通过众包方式录制。
### 标注信息
#### 标注流程
标注标签为预先设定的词汇列表。采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。
#### 标注者信息
[更多信息待补充]
### 个人与敏感信息
本数据集包含在线捐赠语音的个人音频数据。请勿尝试通过本数据集识别说话人身份。
## 数据集使用注意事项
### 数据集社会影响
[更多信息待补充]
### 偏差讨论
[更多信息待补充]
### 其他已知局限
[更多信息待补充]
## 附加信息
### 数据集整理者
[更多信息待补充]
### 许可信息
知识共享署名4.0协议(CC-BY-4.0,https://creativecommons.org/licenses/by/4.0/legalcode)。
### 引用信息
@article{speechcommandsv2,
author = { {Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}
### 贡献致谢
感谢[@polinaeterna](https://github.com/polinaeterna)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-04-21



