MLCommons/unsupervised_peoples_speech

Name: MLCommons/unsupervised_peoples_speech
Creator: MLCommons
Published: 2025-02-27 18:26:32
License: 暂无描述

Hugging Face2025-02-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MLCommons/unsupervised_peoples_speech

下载链接

链接失效反馈

官方服务：

资源简介：

未标注的数据集，包含从Archive.org获取的超过一百万小时的音频文件，这些音频文件适用于学术和商业用途，并遵循CC-BY和CC-BY-SA许可。数据集包含了多种说话者的音频，音频文件被存储在tar文件中，每个tar文件平均大小为5GB。数据集的音频大多数在1到10分钟之间，样本率主要是44.1Khz。数据集没有进行任何预处理或标注，且数据倾向于美国英语口音。

An unlabeled dataset consisting of over one million hours of audio files obtained from Archive.org, licensed for academic and commercial use under CC-BY and CC-BY-SA. The dataset includes audio from a diverse set of speakers, with audio files stored in tar files, each averaging 5GB in size. Most audios are between 1 and 10 minutes long, with a sample rate predominantly of 44.1Khz. The dataset has not undergone any preprocessing or annotation and is biased towards American English accents.

提供机构：

MLCommons

原始信息汇总

数据集卡片：无监督人民语音数据集

数据集描述

数据集概述

无监督人民语音数据集是从Archive.org提取的音频文件集合，适用于学术和商业用途，遵循CC-BY和CC-BY-SA许可。该数据集包含超过一百万小时的音频，涵盖多样化的说话者。

数据集结构

音频文件夹

包含原始音频的文件夹。由于Hugging Face不支持单个目录中超过10,000个文件，因此我们将其分为两个目录。

数据集创建

源数据

初始数据收集和规范化

数据通过archive.org的API下载，未进行数据推断。

预处理

未进行预处理。

标注

标注过程

未进行手动标注，仅下载源音频。特别地，未进行“强制对齐”或“分割”处理。

个人和敏感信息

我们的部分来源包括法律和政府程序、口头故事、演讲等。鉴于这些文件旨在作为公开文档并获得相应许可，相关个人自然知晓这一点。

使用数据的注意事项

偏见讨论

我们的数据从archive.org下载，因此数据偏向于用户决定上传的内容。几乎所有数据都是美国口音的英语。

附加信息

许可信息

源数据包含CC-BY-SA和CC-BY许可下的数据。我们根据https://creativecommons.org/licenses/by-sa/4.0/许可此数据集。

引用信息

请引用以下内容：

@article{USP, author={Daniel Galvez and Ryan Hileman and Rafael Mosquera and Juan Ciro and Kurt Bollacker and Peter Mattson and David Kanter}, title = {Unsupervised Peoples Speech (The Million Hour Audio Dataset)}, year = {2023}, url = {https://huggingface.co/datasets/MLCommons/peoples_speech}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集