five

ml_spoken_words

收藏
魔搭社区2025-11-25 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/MLCommons/ml_spoken_words
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Multilingual Spoken Words ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://mlcommons.org/en/multilingual-spoken-words/ - **Repository:** https://github.com/harvard-edge/multilingual_kws - **Paper:** https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/fe131d7f5a6b38b23cc967316c13dae2-Paper-round2.pdf - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. Data is provided in two formats: `wav` (16KHz) and `opus` (48KHz). Default configurations look like `"{lang}_{format}"`, so to load, for example, Tatar in wav format do: ```python ds = load_dataset("MLCommons/ml_spoken_words", "tt_wav") ``` To download multiple languages in a single dataset pass list of languages to `languages` argument: ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) ``` To download a specific format pass it to the `format` argument (default format is `wav`): ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"], format="opus") ``` Note that each time you provide different sets of languages, examples are generated from scratch even if you already provided one or several of them before because custom configurations are created each time (the data is **not** redownloaded though). ### Supported Tasks and Leaderboards Keyword spotting, Spoken term search ### Languages The dataset is multilingual. To specify several languages to download pass a list of them to the `languages` argument: ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) ``` The dataset contains data for the following languages: Low-resourced (<10 hours): * Arabic (0.1G, 7.6h) * Assamese (0.9M, 0.1h) * Breton (69M, 5.6h) * Chuvash (28M, 2.1h) * Chinese (zh-CN) (42M, 3.1h) * Dhivehi (0.7M, 0.04h) * Frisian (0.1G, 9.6h) * Georgian (20M, 1.4h) * Guarani (0.7M, 1.3h) * Greek (84M, 6.7h) * Hakha Chin (26M, 0.1h) * Hausa (90M, 1.0h) * Interlingua (58M, 4.0h) * Irish (38M, 3.2h) * Latvian (51M, 4.2h) * Lithuanian (21M, 0.46h) * Maltese (88M, 7.3h) * Oriya (0.7M, 0.1h) * Romanian (59M, 4.5h) * Sakha (42M, 3.3h) * Slovenian (43M, 3.0h) * Slovak (31M, 1.9h) * Sursilvan (61M, 4.8h) * Tamil (8.8M, 0.6h) * Vallader (14M, 1.2h) * Vietnamese (1.2M, 0.1h) Medium-resourced (>10 & <100 hours): * Czech (0.3G, 24h) * Dutch (0.8G, 70h) * Estonian (0.2G, 19h) * Esperanto (1.3G, 77h) * Indonesian (0.1G, 11h) * Kyrgyz (0.1G, 12h) * Mongolian (0.1G, 12h) * Portuguese (0.7G, 58h) * Swedish (0.1G, 12h) * Tatar (4G, 30h) * Turkish (1.3G, 29h) * Ukrainian (0.2G, 18h) Hig-resourced (>100 hours): * Basque (1.7G, 118h) * Catalan (8.7G, 615h) * English (26G, 1957h) * French (9.3G, 754h) * German (14G, 1083h) * Italian (2.2G, 155h) * Kinyarwanda (6.1G, 422h) * Persian (4.5G, 327h) * Polish (1.8G, 130h) * Russian (2.1G, 137h) * Spanish (4.9G, 349h) * Welsh (4.5G, 108h) ## Dataset Structure ### Data Instances ```python {'file': 'абзар_common_voice_tt_17737010.opus', 'is_valid': True, 'language': 0, 'speaker_id': '687025afd5ce033048472754c8d2cb1cf8a617e469866bbdb3746e2bb2194202094a715906f91feb1c546893a5d835347f4869e7def2e360ace6616fb4340e38', 'gender': 0, 'keyword': 'абзар', 'audio': {'path': 'абзар_common_voice_tt_17737010.opus', 'array': array([2.03458695e-34, 2.03458695e-34, 2.03458695e-34, ..., 2.03458695e-34, 2.03458695e-34, 2.03458695e-34]), 'sampling_rate': 48000}} ``` ### Data Fields * file: strinrelative audio path inside the archive * is_valid: if a sample is valid * language: language of an instance. Makes sense only when providing multiple languages to the dataset loader (for example, `load_dataset("ml_spoken_words", languages=["ar", "tt"])`) * speaker_id: unique id of a speaker. Can be "NA" if an instance is invalid * gender: speaker gender. Can be one of `["MALE", "FEMALE", "OTHER", "NAN"]` * keyword: word spoken in a current sample * audio: a dictionary containing the relative path to the audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus, it is important to first query the sample index before the "audio" column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]` ### Data Splits The data for each language is splitted into train / validation / test parts. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The data comes form Common Voice dataset. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information he dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and can be used for academic research and commercial applications in keyword spotting and spoken term search. ### Citation Information ``` @inproceedings{mazumder2021multilingual, title={Multilingual Spoken Words Corpus}, author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021} } ``` ### Contributions Thanks to [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.

# 多语言口语单词数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏见讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [致谢贡献](#contributions) ## 数据集概述 - **主页**:https://mlcommons.org/en/multilingual-spoken-words/ - **代码仓库**:https://github.com/harvard-edge/multilingual_kws - **相关论文**:https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/fe131d7f5a6b38b23cc967316c13dae2-Paper-round2.pdf - **排行榜**: - **联系方式**: ### 数据集摘要 多语言口语单词语料库(Multilingual Spoken Words Corpus)是一个大型且持续扩容的音频数据集,涵盖50种语言的口语单词,覆盖全球超50亿使用者的语言,适用于关键词唤醒(keyword spotting)与口语术语搜索领域的学术研究与商业落地场景,采用CC-BY 4.0许可。本数据集包含超过34万个关键词,总计2340万条1秒时长的口语示例(超6000小时)。其应用场景广泛,涵盖语音赋能消费设备至呼叫中心自动化等诸多领域。本数据集通过对众包句子级音频进行强制对齐(forced alignment)以生成单词语音的时间戳估计值并从中提取片段,所有对齐结果均包含在数据集中。 数据提供两种格式:`wav`(16kHz)与`opus`(48kHz)。默认配置格式为`"{lang}_{format}"`,例如若需加载鞑靼语的wav格式数据,可执行: python ds = load_dataset("MLCommons/ml_spoken_words", "tt_wav") 若需一次性下载多种语言的数据,可向`languages`参数传入语言列表: python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) 若需指定特定格式,可向`format`参数传入格式值(默认格式为`wav`): python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"], format="opus") 请注意,每次传入不同的语言集合时,即便此前已加载过其中一种或多种语言,系统仍会重新生成自定义配置(不过**不会重新下载数据**)。 ### 支持任务与排行榜 关键词唤醒,口语术语搜索 ### 语言覆盖 本数据集为多语言数据集。若需指定下载多种语言,可向`languages`参数传入语言列表: python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) 本数据集包含以下语言的数据: #### 低资源语言(时长<10小时): * 阿拉伯语(0.1G,7.6小时) * 阿萨姆语(0.9M,0.1小时) * 布列塔尼语(69M,5.6小时) * 楚瓦什语(28M,2.1小时) * 中文(简体)(42M,3.1小时) * 迪维希语(0.7M,0.04小时) * 弗里西语(0.1G,9.6小时) * 格鲁吉亚语(20M,1.4小时) * 瓜拉尼语(0.7M,1.3小时) * 希腊语(84M,6.7小时) * 哈卡钦语(26M,0.1小时) * 豪萨语(90M,1.0小时) * 国际语(58M,4.0小时) * 爱尔兰语(38M,3.2小时) * 拉脱维亚语(51M,4.2小时) * 立陶宛语(21M,0.46小时) * 马耳他语(88M,7.3小时) * 奥里亚语(0.7M,0.1小时) * 罗马尼亚语(59M,4.5小时) * 雅库特语(42M,3.3小时) * 斯洛文尼亚语(43M,3.0小时) * 斯洛伐克语(31M,1.9小时) * 苏尔西尔万语(61M,4.8小时) * 泰米尔语(8.8M,0.6小时) * 瓦拉德语(14M,1.2小时) * 越南语(1.2M,0.1小时) #### 中等资源语言(时长>10小时且<100小时): * 捷克语(0.3G,24小时) * 荷兰语(0.8G,70小时) * 爱沙尼亚语(0.2G,19小时) * 世界语(1.3G,77小时) * 印度尼西亚语(0.1G,11小时) * 吉尔吉斯语(0.1G,12小时) * 蒙古语(0.1G,12小时) * 葡萄牙语(0.7G,58小时) * 瑞典语(0.1G,12小时) * 鞑靼语(4G,30小时) * 土耳其语(1.3G,29小时) * 乌克兰语(0.2G,18小时) #### 高资源语言(时长>100小时): * 巴斯克语(1.7G,118小时) * 加泰罗尼亚语(8.7G,615小时) * 英语(26G,1957小时) * 法语(9.3G,754小时) * 德语(14G,1083小时) * 意大利语(2.2G,155小时) * 卢旺达语(6.1G,422小时) * 波斯语(4.5G,327小时) * 波兰语(1.8G,130小时) * 俄语(2.1G,137小时) * 西班牙语(4.9G,349小时) * 威尔士语(4.5G,108小时) ## 数据集结构 ### 数据实例 python {'file': 'абзар_common_voice_tt_17737010.opus', 'is_valid': True, 'language': 0, 'speaker_id': '687025afd5ce033048472754c8d2cb1cf8a617e469866bbdb3746e2bb2194202094a715906f91feb1c546893a5d835347f4869e7def2e360ace6616fb4340e38', 'gender': 0, 'keyword': 'абзар', 'audio': {'path': 'абзар_common_voice_tt_17737010.opus', 'array': array([2.03458695e-34, 2.03458695e-34, 2.03458695e-34, ..., 2.03458695e-34, 2.03458695e-34, 2.03458695e-34]), 'sampling_rate': 48000}} ### 数据字段 * `file`:字符串,归档内的相对音频路径 * `is_valid`:布尔值,标识样本是否有效 * `language`:样本所属语言,仅当向数据集加载器传入多种语言时该字段有实际意义(例如`load_dataset("ml_spoken_words", languages=["ar", "tt"])`) * `speaker_id`:说话人唯一标识符,若样本无效则该字段值为"NA" * `gender`:说话人性别,可选值为`["MALE", "FEMALE", "OTHER", "NAN"]` * `keyword`:当前样本中读出的关键词 * `audio`:字典,包含音频文件相对路径、解码后的音频数组与采样率。请注意,当访问`dataset[0]["audio"]`列时,音频文件会自动被解码并重采样至`dataset.features["audio"].sampling_rate`。解码与重采样大量音频文件可能会消耗大量时间,因此建议优先通过样本索引访问音频列,例如优先使用`dataset[0]["audio"]`而非`dataset["audio"][0]` ### 数据划分 每种语言的数据均划分为训练集、验证集与测试集。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与归一化 本数据集源自通用语音数据集(Common Voice)。 #### 源语言数据提供者 [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者信息 [需补充更多信息] ### 个人与敏感信息 本数据集包含在线捐赠语音的用户的音频数据。您同意不得尝试推断说话人的身份。 ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏见讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 本数据集采用[CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可,可用于关键词唤醒与口语术语搜索相关的学术研究与商业应用。 ### 引用信息 @inproceedings{mazumder2021multilingual, title={Multilingual Spoken Words Corpus}, author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021} } ### 致谢贡献 感谢[@polinaeterna](https://github.com/polinaeterna)贡献本数据集。
提供机构:
maas
创建时间:
2025-02-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作