renumics/speech_commands_enriched
收藏Hugging Face2023-09-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/renumics/speech_commands_enriched
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- other
language_creators:
- crowdsourced
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
- 10K<n<100K
source_datasets:
- extended|speech_commands
task_categories:
- audio-classification
task_ids:
- keyword-spotting
pretty_name: SpeechCommands
config_names:
- v0.01
- v0.02
tags:
- spotlight
- enriched
- renumics
- enhanced
- audio
- classification
- extended
---
# Dataset Card for SpeechCommands
## Dataset Description
- **Homepage:** [Renumics Homepage](https://renumics.com/?hf-dataset-card=speech-commands-enriched)
- **GitHub** [Spotlight](https://github.com/Renumics/spotlight)
- **Dataset Homepage** [tensorflow.org/datasets](https://www.tensorflow.org/datasets/catalog/speech_commands)
- **Paper:** [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/pdf/1804.03209.pdf)
- **Leaderboard:** [More Information Needed]
### Dataset Summary
📊 [Data-centric AI](https://datacentricai.org) principles have become increasingly important for real-world use cases.
At [Renumics](https://renumics.com/?hf-dataset-card=speech-commands-enriched) we believe that classical benchmark datasets and competitions should be extended to reflect this development.
🔍 This is why we are publishing benchmark datasets with application-specific enrichments (e.g. embeddings, baseline results, uncertainties, label error scores). We hope this helps the ML community in the following ways:
1. Enable new researchers to quickly develop a profound understanding of the dataset.
2. Popularize data-centric AI principles and tooling in the ML community.
3. Encourage the sharing of meaningful qualitative insights in addition to traditional quantitative metrics.
📚 This dataset is an enriched version of the [SpeechCommands Dataset](https://huggingface.co/datasets/speech_commands).
### Explore the Dataset

The enrichments allow you to quickly gain insights into the dataset. The open source data curation tool [Renumics Spotlight](https://github.com/Renumics/spotlight) enables that with just a few lines of code:
Install datasets and Spotlight via [pip](https://packaging.python.org/en/latest/key_projects/#pip):
```python
!pip install renumics-spotlight datasets[audio]
```
> **_Notice:_** On Linux, non-Python dependency on libsndfile package must be installed manually. See [Datasets - Installation](https://huggingface.co/docs/datasets/installation#audio) for more information.
Load the dataset from huggingface in your notebook:
```python
import datasets
dataset = datasets.load_dataset("renumics/speech_commands_enriched", "v0.01")
```
[//]: <> (TODO: Update this!)
Start exploring with a simple view:
```python
from renumics import spotlight
df = dataset.to_pandas()
df_show = df.drop(columns=['audio'])
spotlight.show(df_show, port=8000, dtype={"file": spotlight.Audio})
```
You can use the UI to interactively configure the view on the data. Depending on the concrete tasks (e.g. model comparison, debugging, outlier detection) you might want to leverage different enrichments and metadata.
### SpeechCommands Dataset
This is a set of one-second .wav audio files, each containing a single spoken
English word or background noise. These words are from a small set of commands, and are spoken by a
variety of different speakers. This data set is designed to help train simple
machine learning models. It is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209).
Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains
64,727 audio files.
Version 0.02 of the data set (configuration `"v0.02"`) was released on April 11th 2018 and
contains 105,829 audio files.
### Supported Tasks and Leaderboards
* `keyword-spotting`: the dataset can be used to train and evaluate keyword
spotting systems. The task is to detect preregistered keywords by classifying utterances
into a predefined set of words. The task is usually performed on-device for the
fast response time. Thus, accuracy, model size, and inference time are all crucial.
### Languages
The language data in SpeechCommands is in English (BCP-47 `en`).
## Dataset Structure
### Data Instances
Example of a core word (`"label"` is a word, `"is_unknown"` is `False`):
```python
{
"file": "no/7846fd85_nohash_0.wav",
"audio": {
"path": "no/7846fd85_nohash_0.wav",
"array": array([ -0.00021362, -0.00027466, -0.00036621, ..., 0.00079346,
0.00091553, 0.00079346]),
"sampling_rate": 16000
},
"label": 1, # "no"
"is_unknown": False,
"speaker_id": "7846fd85",
"utterance_id": 0
}
```
Example of an auxiliary word (`"label"` is a word, `"is_unknown"` is `True`)
```python
{
"file": "tree/8b775397_nohash_0.wav",
"audio": {
"path": "tree/8b775397_nohash_0.wav",
"array": array([ -0.00854492, -0.01339722, -0.02026367, ..., 0.00274658,
0.00335693, 0.0005188]),
"sampling_rate": 16000
},
"label": 28, # "tree"
"is_unknown": True,
"speaker_id": "1b88bf70",
"utterance_id": 0
}
```
Example of background noise (`_silence_`) class:
```python
{
"file": "_silence_/doing_the_dishes.wav",
"audio": {
"path": "_silence_/doing_the_dishes.wav",
"array": array([ 0. , 0. , 0. , ..., -0.00592041,
-0.00405884, -0.00253296]),
"sampling_rate": 16000
},
"label": 30, # "_silence_"
"is_unknown": False,
"speaker_id": "None",
"utterance_id": 0 # doesn't make sense here
}
```
### Data Fields
* `file`: relative audio filename inside the original archive.
* `audio`: dictionary containing a relative audio filename,
a decoded audio array, and the sampling rate. Note that when accessing
the audio column: `dataset[0]["audio"]` the audio is automatically decoded
and resampled to `dataset.features["audio"].sampling_rate`.
Decoding and resampling of a large number of audios might take a significant
amount of time. Thus, it is important to first query the sample index before
the `"audio"` column, i.e. `dataset[0]["audio"]` should always be preferred
over `dataset["audio"][0]`.
* `label`: either word pronounced in an audio sample or background noise (`_silence_`) class.
Note that it's an integer value corresponding to the class name.
* `is_unknown`: if a word is auxiliary. Equals to `False` if a word is a core word or `_silence_`,
`True` if a word is an auxiliary word.
* `speaker_id`: unique id of a speaker. Equals to `None` if label is `_silence_`.
* `utterance_id`: incremental id of a word utterance within the same speaker.
### Data Splits
The dataset has two versions (= configurations): `"v0.01"` and `"v0.02"`. `"v0.02"`
contains more words (see section [Source Data](#source-data) for more details).
| | train | validation | test |
|----- |------:|-----------:|-----:|
| v0.01 | 51093 | 6799 | 3081 |
| v0.02 | 84848 | 9982 | 4890 |
Note that in train and validation sets examples of `_silence_` class are longer than 1 second.
You can use the following code to sample 1-second examples from the longer ones:
```python
def sample_noise(example):
# Use this function to extract random 1 sec slices of each _silence_ utterance,
# e.g. inside `torch.utils.data.Dataset.__getitem__()`
from random import randint
if example["label"] == "_silence_":
random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]
return example
```
## Dataset Creation
### Curation Rationale
The primary goal of the dataset is to provide a way to build and test small
models that can detect a single word from a set of target words and differentiate it
from background noise or unrelated speech with as few false positives as possible.
### Source Data
#### Initial Data Collection and Normalization
The audio files were collected using crowdsourcing, see
[aiyprojects.withgoogle.com/open_speech_recording](https://github.com/petewarden/extract_loudest_section)
for some of the open source audio collection code that was used. The goal was to gather examples of
people speaking single-word commands, rather than conversational sentences, so
they were prompted for individual words over the course of a five minute
session.
In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine",
"Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".
In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual".
In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation
it is marked by `True` value of `"is_unknown"` feature). Their function is to teach a model to distinguish core words
from unrecognized ones.
The `_silence_` label contains a set of longer audio clips that are either recordings or
a mathematical simulation of noise.
#### Who are the source language producers?
The audio files were collected using crowdsourcing.
### Annotations
#### Annotation process
Labels are the list of words prepared in advances.
Speakers were prompted for individual words over the course of a five minute
session.
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Creative Commons BY 4.0 License ((CC-BY-4.0)[https://creativecommons.org/licenses/by/4.0/legalcode]).
### Citation Information
```
@article{speechcommandsv2,
author = { {Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}
```
### Contributions
[More Information Needed]
提供机构:
renumics
原始信息汇总
数据集概述
名称: SpeechCommands
版本: v0.01, v0.02
语言: 英语 (en)
许可证: cc-by-4.0
多语言性: 单语种
大小:
- 10K<n<100K (v0.01: 64,727 files)
- 100K<n<1M (v0.02: 105,829 files)
源数据集: 扩展自 speech_commands
任务类别: 音频分类
任务ID: keyword-spotting
配置名称: v0.01, v0.02
标签:
- spotlight
- enriched
- renumics
- enhanced
- audio
- classification
- extended
数据集描述
摘要: 该数据集包含一秒钟的.wav音频文件,每个文件包含一个单独的英语单词或背景噪音。这些单词来自一组小命令,由各种不同的说话者说出。数据集旨在帮助训练简单的机器学习模型。
结构:
- 数据实例: 包含核心词、辅助词和背景噪音的示例。
- 数据字段:
file,audio,label,is_unknown,speaker_id,utterance_id。 - 数据分割: 数据集有两个版本,每个版本包含训练、验证和测试集。
使用场景
支持的任务: 关键字检测 (keyword-spotting)
应用: 用于训练和评估关键字检测系统,任务是在预注册的关键字中分类语音。
数据集创建
采集理由: 主要目标是提供一种构建和测试小型模型的方法,该模型可以从一组目标单词中检测单个单词,并将其与背景噪音或无关语音区分开来,尽可能减少误报。
源数据:
- 初始数据收集和规范化: 音频文件通过众包收集。
- 语言生产者: 众包。
注释:
- 注释过程: 标签是预先准备好的单词列表。
- 注释者: 未详细说明。
注意事项
个人和敏感信息: 数据集包含在线捐赠的语音。用户同意不尝试确定此数据集中说话者的身份。
附加信息
许可证: Creative Commons BY 4.0 License (CC-BY-4.0)
引用信息: 请参阅提供的引用信息以正确引用此数据集。



