speech_commands

Name: speech_commands
Creator: maas
Published: 2026-05-06 16:46:21
License: 暂无描述

魔搭社区2026-05-06 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/speech_commands

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for SpeechCommands ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.tensorflow.org/datasets/catalog/speech_commands - **Repository:** [More Information Needed] - **Paper:** [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/pdf/1804.03209.pdf) - **Leaderboard:** [More Information Needed] - **Point of Contact:** Pete Warden, petewarden@google.com ### Dataset Summary This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. It is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209). Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains 64,727 audio files. Version 0.02 of the data set (configuration `"v0.02"`) was released on April 11th 2018 and contains 105,829 audio files. ### Supported Tasks and Leaderboards * `keyword-spotting`: the dataset can be used to train and evaluate keyword spotting systems. The task is to detect preregistered keywords by classifying utterances into a predefined set of words. The task is usually performed on-device for the fast response time. Thus, accuracy, model size, and inference time are all crucial. ### Languages The language data in SpeechCommands is in English (BCP-47 `en`). ## Dataset Structure ### Data Instances Example of a core word (`"label"` is a word, `"is_unknown"` is `False`): ```python { "file": "no/7846fd85_nohash_0.wav", "audio": { "path": "no/7846fd85_nohash_0.wav", "array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346, 0.00091553, 0.00079346]), "sampling_rate": 16000 }, "label": 1, # "no" "is_unknown": False, "speaker_id": "7846fd85", "utterance_id": 0 } ``` Example of an auxiliary word (`"label"` is a word, `"is_unknown"` is `True`) ```python { "file": "tree/8b775397_nohash_0.wav", "audio": { "path": "tree/8b775397_nohash_0.wav", "array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658, 0.00335693, 0.0005188]), "sampling_rate": 16000 }, "label": 28, # "tree" "is_unknown": True, "speaker_id": "1b88bf70", "utterance_id": 0 } ``` Example of background noise (`_silence_`) class: ```python { "file": "_silence_/doing_the_dishes.wav", "audio": { "path": "_silence_/doing_the_dishes.wav", "array": array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296]), "sampling_rate": 16000 }, "label": 30, # "_silence_" "is_unknown": False, "speaker_id": "None", "utterance_id": 0 # doesn't make sense here } ``` ### Data Fields * `file`: relative audio filename inside the original archive. * `audio`: dictionary containing a relative audio filename, a decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audios might take a significant amount of time. Thus, it is important to first query the sample index before the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]`. * `label`: either word pronounced in an audio sample or background noise (`_silence_`) class. Note that it's an integer value corresponding to the class name. * `is_unknown`: if a word is auxiliary. Equals to `False` if a word is a core word or `_silence_`, `True` if a word is an auxiliary word. * `speaker_id`: unique id of a speaker. Equals to `None` if label is `_silence_`. * `utterance_id`: incremental id of a word utterance within the same speaker. ### Data Splits The dataset has two versions (= configurations): `"v0.01"` and `"v0.02"`. `"v0.02"` contains more words (see section [Source Data](#source-data) for more details). | | train | validation | test | |----- |------:|-----------:|-----:| | v0.01 | 51093 | 6799 | 3081 | | v0.02 | 84848 | 9982 | 4890 | Note that in train and validation sets examples of `_silence_` class are longer than 1 second. You can use the following code to sample 1-second examples from the longer ones: ```python def sample_noise(example): # Use this function to extract random 1 sec slices of each _silence_ utterance, # e.g. inside `torch.utils.data.Dataset.__getitem__()` from random import randint if example["label"] == "_silence_": random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1) example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]] return example ``` ## Dataset Creation ### Curation Rationale The primary goal of the dataset is to provide a way to build and test small models that can detect a single word from a set of target words and differentiate it from background noise or unrelated speech with as few false positives as possible. ### Source Data #### Initial Data Collection and Normalization The audio files were collected using crowdsourcing, see [aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section) for some of the open source audio collection code that was used. The goal was to gather examples of people speaking single-word commands, rather than conversational sentences, so they were prompted for individual words over the course of a five minute session. In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow". In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual". In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation it is marked by `True` value of `"is_unknown"` feature). Their function is to teach a model to distinguish core words from unrecognized ones. The `_silence_` label contains a set of longer audio clips that are either recordings or a mathematical simulation of noise. #### Who are the source language producers? The audio files were collected using crowdsourcing. ### Annotations #### Annotation process Labels are the list of words prepared in advances. Speakers were prompted for individual words over the course of a five minute session. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Creative Commons BY 4.0 License ((CC-BY-4.0)[https://creativecommons.org/licenses/by/4.0/legalcode]). ### Citation Information ``` @article{speechcommandsv2, author = { {Warden}, P.}, title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1804.03209}, primaryClass = "cs.CL", keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction}, year = 2018, month = apr, url = {https://arxiv.org/abs/1804.03209}, } ``` ### Contributions Thanks to [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.

# SpeechCommands 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建动因](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：https://www.tensorflow.org/datasets/catalog/speech_commands - **代码仓库**：[更多信息待补充] - **相关论文**：[《Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition》](https://arxiv.org/pdf/1804.03209.pdf) - **排行榜**：[更多信息待补充] - **联系人**：Pete Warden，petewarden@google.com ### 数据集概述本数据集包含若干时长为1秒的.wav音频文件，每个文件对应单句英语口语词汇或背景噪声。这些词汇均来自预设的命令词集合，由多名不同说话人录制。本数据集旨在助力简单机器学习模型的训练，详细说明可参见[https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209)。数据集0.01版（配置项`"v0.01"`）于2017年8月3日发布，共包含64,727个音频文件。数据集0.02版（配置项`"v0.02"`）于2018年4月11日发布，共包含105,829个音频文件。 ### 支持任务与排行榜 * `keyword-spotting`：关键词检测（keyword spotting）任务。本数据集可用于训练与评估关键词检测系统，任务核心为将语音样本分类至预设词汇集合，从而识别预注册关键词。该任务通常部署在端侧以实现快速响应，因此准确率、模型尺寸与推理时延均为关键指标。 ### 语言本数据集的语言为英语（BCP-47 标识`en`）。 ## 数据集结构 ### 数据实例核心词汇示例（`"label"`为对应词汇，`"is_unknown"`取值为`False`）： python { "file": "no/7846fd85_nohash_0.wav", "audio": { "path": "no/7846fd85_nohash_0.wav", "array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346, 0.00091553, 0.00079346]), "sampling_rate": 16000 }, "label": 1, # "no" "is_unknown": False, "speaker_id": "7846fd85", "utterance_id": 0 } 辅助词汇示例（`"label"`为对应词汇，`"is_unknown"`取值为`True`）： python { "file": "tree/8b775397_nohash_0.wav", "audio": { "path": "tree/8b775397_nohash_0.wav", "array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658, 0.00335693, 0.0005188]), "sampling_rate": 16000 }, "label": 28, # "tree" "is_unknown": True, "speaker_id": "1b88bf70", "utterance_id": 0 } 背景噪声（`_silence_`）类别示例： python { "file": "_silence_/doing_the_dishes.wav", "audio": { "path": "_silence_/doing_the_dishes.wav", "array": array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296]), "sampling_rate": 16000 }, "label": 30, # "_silence_" "is_unknown": False, "speaker_id": "None", "utterance_id": 0 # 此处无实际意义 } ### 数据字段 * `file`：原始归档内的音频文件相对路径。 * `audio`：包含音频相对路径、解码后的音频数组与采样率的字典。需注意，当访问`dataset[0]["audio"]`时，音频会自动完成解码与重采样，对齐至`dataset.features["audio"].sampling_rate`指定的采样率。对大量音频进行解码与重采样会耗费较多时间，因此推荐优先通过样本索引访问音频列，例如`dataset[0]["audio"]`的访问方式远优于`dataset["audio"][0]`。 * `label`：音频样本对应的发音词汇或背景噪声（`_silence_`）类别，为与类别名对应的整数值。 * `is_unknown`：标记词汇是否为辅助词。若为核心词汇或`_silence_`类别则取值为`False`，若为辅助词汇则取值为`True`。 * `speaker_id`：说话人的唯一标识，若标签为`_silence_`则取值为`None`。 * `utterance_id`：同一说话人下的单词语音样本的递增编号。 ### 数据划分本数据集包含两个版本（即配置项）：`"v0.01"`与`"v0.02"`。其中`"v0.02"`版本包含更多词汇（详见[源数据](#source-data)章节）。 | | 训练集 | 验证集 | 测试集 | |----- |------:|-----------:|-----:| | v0.01 | 51093 | 6799 | 3081 | | v0.02 | 84848 | 9982 | 4890 | 需注意，训练集与验证集中的`_silence_`类样本时长超过1秒。可使用以下代码从长样本中截取1秒时长的片段： python def sample_noise(example): # Use this function to extract random 1 sec slices of each _silence_ utterance, # e.g. inside `torch.utils.data.Dataset.__getitem__()` from random import randint if example["label"] == "_silence_": random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1) example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]] return example ## 数据集构建 ### 构建动因本数据集的核心构建目标为提供一种方法，用于开发与测试小型模型，使其能够从预设目标词汇集合中识别单词语音，并尽可能降低将背景噪声或无关语音误判为目标词汇的概率。 ### 源数据 #### 初始数据收集与标准化音频文件通过众包方式收集，部分开源音频采集代码可参见[aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section)。本次采集的目标为获取单命令词的语音样本，而非会话语句，因此采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。 0.01版数据集包含30个录制词汇："Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"、"Zero"、"One"、"Two"、"Three"、"Four"、"Five"、"Six"、"Seven"、"Eight"、"Nine"、"Bed"、"Bird"、"Cat"、"Dog"、"Happy"、"House"、"Marvin"、"Sheila"、"Tree"、"Wow"。 0.02版数据集新增了以下词汇："Backward"、"Forward"、"Follow"、"Learn"、"Visual"。在两个版本中，约定俗成的10个词汇被用作命令词："Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"。其余词汇均被视为辅助词（在当前实现中通过`"is_unknown"`特征取值为`True`标记），其作用为辅助模型区分核心词汇与未识别词汇。 `_silence_`标签对应的类别包含若干较长的音频片段，这些片段为真实环境噪声录音或数学模拟噪声。 #### 源语言录制者音频文件均通过众包方式录制。 ### 标注信息 #### 标注流程标注标签为预先设定的词汇列表。采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。 #### 标注者信息 [更多信息待补充] ### 个人与敏感信息本数据集包含在线捐赠语音的个人音频数据。请勿尝试通过本数据集识别说话人身份。 ## 数据集使用注意事项 ### 数据集社会影响 [更多信息待补充] ### 偏差讨论 [更多信息待补充] ### 其他已知局限 [更多信息待补充] ## 附加信息 ### 数据集整理者 [更多信息待补充] ### 许可信息知识共享署名4.0协议（CC-BY-4.0，https://creativecommons.org/licenses/by/4.0/legalcode）。 ### 引用信息 @article{speechcommandsv2, author = { {Warden}, P.}, title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1804.03209}, primaryClass = "cs.CL", keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction}, year = 2018, month = apr, url = {https://arxiv.org/abs/1804.03209}, } ### 贡献致谢感谢[@polinaeterna](https://github.com/polinaeterna)为本数据集的添加工作。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集