five

speech_commands

收藏
魔搭社区2026-05-06 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/speech_commands
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for SpeechCommands ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.tensorflow.org/datasets/catalog/speech_commands - **Repository:** [More Information Needed] - **Paper:** [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/pdf/1804.03209.pdf) - **Leaderboard:** [More Information Needed] - **Point of Contact:** Pete Warden, petewarden@google.com ### Dataset Summary This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. It is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209). Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains 64,727 audio files. Version 0.02 of the data set (configuration `"v0.02"`) was released on April 11th 2018 and contains 105,829 audio files. ### Supported Tasks and Leaderboards * `keyword-spotting`: the dataset can be used to train and evaluate keyword spotting systems. The task is to detect preregistered keywords by classifying utterances into a predefined set of words. The task is usually performed on-device for the fast response time. Thus, accuracy, model size, and inference time are all crucial. ### Languages The language data in SpeechCommands is in English (BCP-47 `en`). ## Dataset Structure ### Data Instances Example of a core word (`"label"` is a word, `"is_unknown"` is `False`): ```python { "file": "no/7846fd85_nohash_0.wav", "audio": { "path": "no/7846fd85_nohash_0.wav", "array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346, 0.00091553, 0.00079346]), "sampling_rate": 16000 }, "label": 1, # "no" "is_unknown": False, "speaker_id": "7846fd85", "utterance_id": 0 } ``` Example of an auxiliary word (`"label"` is a word, `"is_unknown"` is `True`) ```python { "file": "tree/8b775397_nohash_0.wav", "audio": { "path": "tree/8b775397_nohash_0.wav", "array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658, 0.00335693, 0.0005188]), "sampling_rate": 16000 }, "label": 28, # "tree" "is_unknown": True, "speaker_id": "1b88bf70", "utterance_id": 0 } ``` Example of background noise (`_silence_`) class: ```python { "file": "_silence_/doing_the_dishes.wav", "audio": { "path": "_silence_/doing_the_dishes.wav", "array": array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296]), "sampling_rate": 16000 }, "label": 30, # "_silence_" "is_unknown": False, "speaker_id": "None", "utterance_id": 0 # doesn't make sense here } ``` ### Data Fields * `file`: relative audio filename inside the original archive. * `audio`: dictionary containing a relative audio filename, a decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audios might take a significant amount of time. Thus, it is important to first query the sample index before the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]`. * `label`: either word pronounced in an audio sample or background noise (`_silence_`) class. Note that it's an integer value corresponding to the class name. * `is_unknown`: if a word is auxiliary. Equals to `False` if a word is a core word or `_silence_`, `True` if a word is an auxiliary word. * `speaker_id`: unique id of a speaker. Equals to `None` if label is `_silence_`. * `utterance_id`: incremental id of a word utterance within the same speaker. ### Data Splits The dataset has two versions (= configurations): `"v0.01"` and `"v0.02"`. `"v0.02"` contains more words (see section [Source Data](#source-data) for more details). | | train | validation | test | |----- |------:|-----------:|-----:| | v0.01 | 51093 | 6799 | 3081 | | v0.02 | 84848 | 9982 | 4890 | Note that in train and validation sets examples of `_silence_` class are longer than 1 second. You can use the following code to sample 1-second examples from the longer ones: ```python def sample_noise(example): # Use this function to extract random 1 sec slices of each _silence_ utterance, # e.g. inside `torch.utils.data.Dataset.__getitem__()` from random import randint if example["label"] == "_silence_": random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1) example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]] return example ``` ## Dataset Creation ### Curation Rationale The primary goal of the dataset is to provide a way to build and test small models that can detect a single word from a set of target words and differentiate it from background noise or unrelated speech with as few false positives as possible. ### Source Data #### Initial Data Collection and Normalization The audio files were collected using crowdsourcing, see [aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section) for some of the open source audio collection code that was used. The goal was to gather examples of people speaking single-word commands, rather than conversational sentences, so they were prompted for individual words over the course of a five minute session. In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow". In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual". In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation it is marked by `True` value of `"is_unknown"` feature). Their function is to teach a model to distinguish core words from unrecognized ones. The `_silence_` label contains a set of longer audio clips that are either recordings or a mathematical simulation of noise. #### Who are the source language producers? The audio files were collected using crowdsourcing. ### Annotations #### Annotation process Labels are the list of words prepared in advances. Speakers were prompted for individual words over the course of a five minute session. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Creative Commons BY 4.0 License ((CC-BY-4.0)[https://creativecommons.org/licenses/by/4.0/legalcode]). ### Citation Information ``` @article{speechcommandsv2, author = { {Warden}, P.}, title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1804.03209}, primaryClass = "cs.CL", keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction}, year = 2018, month = apr, url = {https://arxiv.org/abs/1804.03209}, } ``` ### Contributions Thanks to [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.

# SpeechCommands 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建动因](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**:https://www.tensorflow.org/datasets/catalog/speech_commands - **代码仓库**:[更多信息待补充] - **相关论文**:[《Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition》](https://arxiv.org/pdf/1804.03209.pdf) - **排行榜**:[更多信息待补充] - **联系人**:Pete Warden,petewarden@google.com ### 数据集概述 本数据集包含若干时长为1秒的.wav音频文件,每个文件对应单句英语口语词汇或背景噪声。这些词汇均来自预设的命令词集合,由多名不同说话人录制。本数据集旨在助力简单机器学习模型的训练,详细说明可参见[https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209)。 数据集0.01版(配置项`"v0.01"`)于2017年8月3日发布,共包含64,727个音频文件。数据集0.02版(配置项`"v0.02"`)于2018年4月11日发布,共包含105,829个音频文件。 ### 支持任务与排行榜 * `keyword-spotting`:关键词检测(keyword spotting)任务。本数据集可用于训练与评估关键词检测系统,任务核心为将语音样本分类至预设词汇集合,从而识别预注册关键词。该任务通常部署在端侧以实现快速响应,因此准确率、模型尺寸与推理时延均为关键指标。 ### 语言 本数据集的语言为英语(BCP-47 标识`en`)。 ## 数据集结构 ### 数据实例 核心词汇示例(`"label"`为对应词汇,`"is_unknown"`取值为`False`): python { "file": "no/7846fd85_nohash_0.wav", "audio": { "path": "no/7846fd85_nohash_0.wav", "array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346, 0.00091553, 0.00079346]), "sampling_rate": 16000 }, "label": 1, # "no" "is_unknown": False, "speaker_id": "7846fd85", "utterance_id": 0 } 辅助词汇示例(`"label"`为对应词汇,`"is_unknown"`取值为`True`): python { "file": "tree/8b775397_nohash_0.wav", "audio": { "path": "tree/8b775397_nohash_0.wav", "array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658, 0.00335693, 0.0005188]), "sampling_rate": 16000 }, "label": 28, # "tree" "is_unknown": True, "speaker_id": "1b88bf70", "utterance_id": 0 } 背景噪声(`_silence_`)类别示例: python { "file": "_silence_/doing_the_dishes.wav", "audio": { "path": "_silence_/doing_the_dishes.wav", "array": array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296]), "sampling_rate": 16000 }, "label": 30, # "_silence_" "is_unknown": False, "speaker_id": "None", "utterance_id": 0 # 此处无实际意义 } ### 数据字段 * `file`:原始归档内的音频文件相对路径。 * `audio`:包含音频相对路径、解码后的音频数组与采样率的字典。需注意,当访问`dataset[0]["audio"]`时,音频会自动完成解码与重采样,对齐至`dataset.features["audio"].sampling_rate`指定的采样率。对大量音频进行解码与重采样会耗费较多时间,因此推荐优先通过样本索引访问音频列,例如`dataset[0]["audio"]`的访问方式远优于`dataset["audio"][0]`。 * `label`:音频样本对应的发音词汇或背景噪声(`_silence_`)类别,为与类别名对应的整数值。 * `is_unknown`:标记词汇是否为辅助词。若为核心词汇或`_silence_`类别则取值为`False`,若为辅助词汇则取值为`True`。 * `speaker_id`:说话人的唯一标识,若标签为`_silence_`则取值为`None`。 * `utterance_id`:同一说话人下的单词语音样本的递增编号。 ### 数据划分 本数据集包含两个版本(即配置项):`"v0.01"`与`"v0.02"`。其中`"v0.02"`版本包含更多词汇(详见[源数据](#source-data)章节)。 | | 训练集 | 验证集 | 测试集 | |----- |------:|-----------:|-----:| | v0.01 | 51093 | 6799 | 3081 | | v0.02 | 84848 | 9982 | 4890 | 需注意,训练集与验证集中的`_silence_`类样本时长超过1秒。可使用以下代码从长样本中截取1秒时长的片段: python def sample_noise(example): # Use this function to extract random 1 sec slices of each _silence_ utterance, # e.g. inside `torch.utils.data.Dataset.__getitem__()` from random import randint if example["label"] == "_silence_": random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1) example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]] return example ## 数据集构建 ### 构建动因 本数据集的核心构建目标为提供一种方法,用于开发与测试小型模型,使其能够从预设目标词汇集合中识别单词语音,并尽可能降低将背景噪声或无关语音误判为目标词汇的概率。 ### 源数据 #### 初始数据收集与标准化 音频文件通过众包方式收集,部分开源音频采集代码可参见[aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section)。本次采集的目标为获取单命令词的语音样本,而非会话语句,因此采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。 0.01版数据集包含30个录制词汇:"Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"、"Zero"、"One"、"Two"、"Three"、"Four"、"Five"、"Six"、"Seven"、"Eight"、"Nine"、"Bed"、"Bird"、"Cat"、"Dog"、"Happy"、"House"、"Marvin"、"Sheila"、"Tree"、"Wow"。 0.02版数据集新增了以下词汇:"Backward"、"Forward"、"Follow"、"Learn"、"Visual"。 在两个版本中,约定俗成的10个词汇被用作命令词:"Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop"、"Go"。其余词汇均被视为辅助词(在当前实现中通过`"is_unknown"`特征取值为`True`标记),其作用为辅助模型区分核心词汇与未识别词汇。 `_silence_`标签对应的类别包含若干较长的音频片段,这些片段为真实环境噪声录音或数学模拟噪声。 #### 源语言录制者 音频文件均通过众包方式录制。 ### 标注信息 #### 标注流程 标注标签为预先设定的词汇列表。采集过程中要求说话人在5分钟的会话内逐个录制预设词汇。 #### 标注者信息 [更多信息待补充] ### 个人与敏感信息 本数据集包含在线捐赠语音的个人音频数据。请勿尝试通过本数据集识别说话人身份。 ## 数据集使用注意事项 ### 数据集社会影响 [更多信息待补充] ### 偏差讨论 [更多信息待补充] ### 其他已知局限 [更多信息待补充] ## 附加信息 ### 数据集整理者 [更多信息待补充] ### 许可信息 知识共享署名4.0协议(CC-BY-4.0,https://creativecommons.org/licenses/by/4.0/legalcode)。 ### 引用信息 @article{speechcommandsv2, author = { {Warden}, P.}, title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1804.03209}, primaryClass = "cs.CL", keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction}, year = 2018, month = apr, url = {https://arxiv.org/abs/1804.03209}, } ### 贡献致谢 感谢[@polinaeterna](https://github.com/polinaeterna)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作