ml_spoken_words|多语言处理数据集|语音识别数据集

魔搭社区2025-06-13 更新2025-02-15 收录

多语言处理

语音识别

下载链接：

https://modelscope.cn/datasets/MLCommons/ml_spoken_words

下载链接

链接失效反馈

资源简介：

# Dataset Card for Multilingual Spoken Words ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://mlcommons.org/en/multilingual-spoken-words/ - **Repository:** https://github.com/harvard-edge/multilingual_kws - **Paper:** https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/fe131d7f5a6b38b23cc967316c13dae2-Paper-round2.pdf - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. Data is provided in two formats: `wav` (16KHz) and `opus` (48KHz). Default configurations look like `"{lang}_{format}"`, so to load, for example, Tatar in wav format do: ```python ds = load_dataset("MLCommons/ml_spoken_words", "tt_wav") ``` To download multiple languages in a single dataset pass list of languages to `languages` argument: ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) ``` To download a specific format pass it to the `format` argument (default format is `wav`): ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"], format="opus") ``` Note that each time you provide different sets of languages, examples are generated from scratch even if you already provided one or several of them before because custom configurations are created each time (the data is **not** redownloaded though). ### Supported Tasks and Leaderboards Keyword spotting, Spoken term search ### Languages The dataset is multilingual. To specify several languages to download pass a list of them to the `languages` argument: ```python ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"]) ``` The dataset contains data for the following languages: Low-resourced (10 & 100 hours): * Basque (1.7G, 118h) * Catalan (8.7G, 615h) * English (26G, 1957h) * French (9.3G, 754h) * German (14G, 1083h) * Italian (2.2G, 155h) * Kinyarwanda (6.1G, 422h) * Persian (4.5G, 327h) * Polish (1.8G, 130h) * Russian (2.1G, 137h) * Spanish (4.9G, 349h) * Welsh (4.5G, 108h) ## Dataset Structure ### Data Instances ```python {'file': 'абзар_common_voice_tt_17737010.opus', 'is_valid': True, 'language': 0, 'speaker_id': '687025afd5ce033048472754c8d2cb1cf8a617e469866bbdb3746e2bb2194202094a715906f91feb1c546893a5d835347f4869e7def2e360ace6616fb4340e38', 'gender': 0, 'keyword': 'абзар', 'audio': {'path': 'абзар_common_voice_tt_17737010.opus', 'array': array([2.03458695e-34, 2.03458695e-34, 2.03458695e-34, ..., 2.03458695e-34, 2.03458695e-34, 2.03458695e-34]), 'sampling_rate': 48000}} ``` ### Data Fields * file: strinrelative audio path inside the archive * is_valid: if a sample is valid * language: language of an instance. Makes sense only when providing multiple languages to the dataset loader (for example, `load_dataset("ml_spoken_words", languages=["ar", "tt"])`) * speaker_id: unique id of a speaker. Can be "NA" if an instance is invalid * gender: speaker gender. Can be one of `["MALE", "FEMALE", "OTHER", "NAN"]` * keyword: word spoken in a current sample * audio: a dictionary containing the relative path to the audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus, it is important to first query the sample index before the "audio" column, i.e. `dataset[0]["audio"]` should always be preferred over `dataset["audio"][0]` ### Data Splits The data for each language is splitted into train / validation / test parts. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The data comes form Common Voice dataset. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information he dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and can be used for academic research and commercial applications in keyword spotting and spoken term search. ### Citation Information ``` @inproceedings{mazumder2021multilingual, title={Multilingual Spoken Words Corpus}, author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021} } ``` ### Contributions Thanks to [@polinaeterna](https://github.com/polinaeterna) for adding this dataset.

提供机构：

maas

创建时间：

2025-02-08

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

CosyVoice 2

CosyVoice 2是由阿里巴巴集团开发的多语言语音合成数据集，旨在通过大规模多语言数据集训练，实现高质量的流式语音合成。数据集通过有限标量量化技术改进语音令牌的利用率，并结合预训练的大型语言模型作为骨干，支持流式和非流式合成。数据集的创建过程包括文本令牌化、监督语义语音令牌化、统一文本-语音语言模型和块感知流匹配模型等步骤。该数据集主要应用于语音合成领域，旨在解决高延迟和低自然度的问题，提供接近人类水平的语音合成质量。

arXiv 收录

PlantVillage

在这个数据集中，39 种不同类别的植物叶子和背景图像可用。包含 61,486 张图像的数据集。我们使用了六种不同的增强技术来增加数据集的大小。这些技术是图像翻转、伽玛校正、噪声注入、PCA 颜色增强、旋转和缩放。

OpenDataLab 收录

glaive-function-calling-openai

该数据集包含用于训练和评估语言模型在函数调用能力上的对话示例。数据集包括一个完整的函数调用示例集合和一个精选的子集，专注于最常用的函数。数据集的结构包括一个完整的数据集和几个测试子集。每个记录都是一个JSON对象，包含对话消息、可用函数定义和实际的函数调用。数据集还包括最常用的函数分布信息，并提供了加载和评估数据集的示例代码。

huggingface 收录

中国1km分辨率逐月NDVI数据集（2001-2023年）

中国1km分辨率逐月NDVI数据集（2001-2023年）根据MODIS MOD13A2数据进行月度最大值合成、镶嵌和裁剪后制作而成，包含多个TIF文件，每个TIF文件对应该月最大值NDVI数据，文件以时间命名。数据值域改为-0.2~1，不再需要除以一万，另外范围扩大到中国及周边地区，可以自行裁剪。数据分为两个文件夹，MVC文件夹中为MOD13A2 NDVI逐月最大值合成结果，mod1k_SGfilter为MVC中数据S-G滤波后的结果。

国家地球系统科学数据中心收录