five

sil-ai/audio-keyword-spotting

收藏
Hugging Face2023-07-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sil-ai/audio-keyword-spotting
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - other language: - eng - en - spa - es - ind - id license: cc-by-4.0 multilinguality: - multilingual source_datasets: - extended|common_voice - MLCommons/ml_spoken_words task_categories: - automatic-speech-recognition task_ids: [] pretty_name: Audio Keyword Spotting tags: - other-keyword-spotting --- # Dataset Card for Audio Keyword Spotting ## Table of Contents - [Table of Contents](#table-of-contents) ## Dataset Description - **Homepage:** https://sil.ai.org - **Point of Contact:** [SIL AI email](mailto:idx_aqua@sil.org) - **Source Data:** [MLCommons/ml_spoken_words](https://huggingface.co/datasets/MLCommons/ml_spoken_words), [trabina GitHub](https://github.com/wswu/trabina) ![sil-ai logo](https://s3.amazonaws.com/moonup/production/uploads/1661440873726-6108057a823007eaf0c7bd10.png) ## Dataset Summary The initial version of this dataset is a subset of [MLCommons/ml_spoken_words](https://huggingface.co/datasets/MLCommons/ml_spoken_words), which is derived from Common Voice, designed for easier loading. Specifically, the subset consists of `ml_spoken_words` files filtered by the names and placenames transliterated in Bible translations, as found in [trabina](https://github.com/wswu/trabina). For our initial experiment, we have focused only on English, Spanish, and Indonesian, three languages whose name spellings are frequently used in other translations. We anticipate growing this dataset in the future to include additional keywords and other languages as the experiment progresses. ### Data Fields * file: strinrelative audio path inside the archive * is_valid: if a sample is valid * language: language of an instance. * speaker_id: unique id of a speaker. Can be "NA" if an instance is invalid * gender: speaker gender. Can be one of `["MALE", "FEMALE", "OTHER", "NAN"]` * keyword: word spoken in a current sample * audio: a dictionary containing the relative path to the audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus, it is important to first query the sample index before the "audio" column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]` ### Data Splits The data for each language is splitted into train / validation / test parts. ## Supported Tasks Keyword spotting and spoken term search ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers. ### Licensing Information The dataset is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and can be used for academic research and commercial applications in keyword spotting and spoken term search.
提供机构:
sil-ai
原始信息汇总

数据集概述

  • 名称: Audio Keyword Spotting
  • 语言: 英语 (eng, en), 西班牙语 (spa, es), 印度尼西亚语 (ind, id)
  • 许可证: CC-BY-4.0
  • 多语言性: 多语言
  • 来源数据集: 扩展自Common Voice, MLCommons/ml_spoken_words
  • 任务类别: 自动语音识别
  • 任务名称: 关键词识别
  • 标签: 其他-关键词识别

数据集详情

  • 描述: 该数据集最初版本是MLCommons/ml_spoken_words的子集,专门设计用于更轻松的加载。子集包含通过名称和地名在圣经翻译中转录的ml_spoken_words文件。目前专注于英语、西班牙语和印度尼西亚语,未来计划扩展到包括更多关键词和其他语言。
  • 数据字段:
    • file: 音频文件在存档中的相对路径
    • is_valid: 样本是否有效
    • language: 实例的语言
    • speaker_id: 说话者的唯一ID,如果实例无效则为"NA"
    • gender: 说话者性别,可选值为["MALE", "FEMALE", "OTHER", "NAN"]
    • keyword: 当前样本中说的单词
    • audio: 包含音频文件的相对路径、解码的音频数组和采样率的字典
  • 数据分割: 每种语言的数据被分割为训练/验证/测试部分。
  • 支持的任务: 关键词识别和口语搜索
  • 许可信息: 数据集根据CC-BY 4.0许可,可用于学术研究和商业应用中的关键词识别和口语搜索。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作