espnet/yodas

Name: espnet/yodas
Creator: espnet
Published: 2024-06-10 02:11:54
License: 暂无描述

Hugging Face2024-06-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/espnet/yodas

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-3.0 --- Updates - 2024/07/09: we also uploaded a new version of YODAS as [YODAS2](https://huggingface.co/datasets/espnet/yodas2), it provides unsegmented audios and higher sampling rate (24k) ## README This is the YODAS manual/automatic subset from our YODAS dataset, it has 369,510 hours of speech. This dataset contains audio utterances and corresponding captions (manual or automatic) from YouTube. Note that manual caption only indicates that it is uploaded by users, but not necessarily transcribed by a human For more details about YODAS dataset, please refer to [our paper](https://arxiv.org/abs/2406.00899) ## Usage: Considering the extremely large size of the entire dataset, we support two modes of dataset loadings: **standard mode**: each subset will be downloaded to the local dish before first iterating. ```python from datasets import load_dataset # Note this will take very long time to download and preprocess # you can try small subset for testing purpose ds = load_dataset('espnet/yodas', 'en000') print(next(iter(ds['train']))) ``` **streaming mode** most of the files will be streamed instead of downloaded to your local deivce. It can be used to inspect this dataset quickly. ```python from datasets import load_dataset # this streaming loading will finish quickly ds = load_dataset('espnet/yodas', 'en000', streaming=True) #{'id': '9774', 'utt_id': 'YoRjzEnRcqu-00000-00000716-00000819', 'audio': {'path': None, 'array': array([-0.009552 , -0.01086426, -0.012146 , ..., -0.01992798, # -0.01885986, -0.01074219]), 'sampling_rate': 16000}, 'text': 'There is a saying'} print(next(iter(ds['train']))) ``` ## Subsets/Shards There are 149 languages in this dataset, each language is sharded into at least 1 shard to make it easy for our processing and uploading purposes. The raw data of each shard contains 500G at most. Statistics of each shard can be found in the last section. We distinguish manual caption subset and automatic caption subset by the first digit in each shard's name. The first digit is 0 if it contains manual captions, 1 if it contains automatic captions. For example, `en000` to `en005` are the English shards containing manual subsets, and `en100` to `en127` contains the automatic subsets. ## Reference ``` @inproceedings{li2023yodas, title={Yodas: Youtube-Oriented Dataset for Audio and Speech}, author={Li, Xinjian and Takamichi, Shinnosuke and Saeki, Takaaki and Chen, William and Shiota, Sayaka and Watanabe, Shinji}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } ``` ## Contact If you have any questions, feel free to contact us at the following email address. We made sure that our dataset only consisted of videos with CC licenses during our downloading. But in case you find your video unintentionally included in our dataset and would like to delete it, you can send a delete request to the following email. Remove the parenthesis `()` from the following email address `(lixinjian)(1217)@gmail.com` ## Statistics Note that there are no overlappings across different subsets, each audio can be included in the dataset at most once. | Subset name | Hours | |------|--------| |aa000|0.171472| |ab000|0.358342| |af000|0.880497| |ak000|0.250858| |am000|0.924708| |ar000|289.707| |as000|0.548239| |ay000|0.0342722| |az000|3.8537| |ba000|0.0210556| |be000|48.1537| |bg000|46.8375| |bh000|0.0127111| |bi000|0.0125556| |bm000|0.00214722| |bn000|27.064| |bo000|0.746211| |br000|0.729914| |bs000|9.36959| |ca000|74.1909| |co000|0.0418639| |cr000|0.00584167| |cs000|167.604| |cy000|5.20017| |da000|27.4345| |de000|3063.81| |de100|4998.11| |de101|4995.08| |de102|955.389| |dz000|0.06365| |ee000|0.0411722| |el000|126.75| |en000|4999.73| |en001|5032.69| |en002|5039.9| |en003|5001.4| |en004|5054.66| |en005|4027.02| |en100|5147.07| |en101|5123.05| |en102|5117.68| |en103|5127.3| |en104|5126.33| |en105|5097.65| |en106|5131.47| |en107|5135.6| |en108|5136.84| |en109|5112.94| |en110|5109| |en111|5118.69| |en112|5122.57| |en113|5122.31| |en114|5112.36| |en115|5112.27| |en116|5123.77| |en117|5117.31| |en118|5117.94| |en119|5133.05| |en120|5127.79| |en121|5129.08| |en122|5130.22| |en123|5097.56| |en124|5116.59| |en125|5109.76| |en126|5136.21| |en127|2404.89| |eo000|12.6874| |es000|3737.86| |es100|5125.25| |es101|5130.44| |es102|5145.66| |es103|5138.26| |es104|5139.57| |es105|5138.95| |es106|2605.26| |et000|14.4129| |eu000|19.6356| |fa000|42.6734| |ff000|0.0394972| |fi000|212.899| |fj000|0.0167806| |fo000|0.183244| |fr000|2423.7| |fr100|5074.93| |fr101|5057.79| |fr102|5094.14| |fr103|3222.95| |fy000|0.0651667| |ga000|1.49252| |gd000|0.01885| |gl000|9.52575| |gn000|0.181356| |gu000|1.99355| |ha000|0.102931| |hi000|480.79| |hi100|2.74865| |ho000|0.0562194| |hr000|25.9171| |ht000|1.07494| |hu000|181.763| |hy000|1.64412| |ia000|0.0856056| |id000|1420.09| |id100|4902.79| |id101|3560.82| |ie000|0.134603| |ig000|0.086875| |ik000|0.00436667| |is000|5.07075| |it000|1454.98| |it100|4989.62| |it101|4242.87| |iu000|0.0584278| |iw000|161.373| |ja000|1094.18| |ja100|2929.94| |jv000|1.08701| |ka000|26.9727| |ki000|0.000555556| |kk000|3.72081| |kl000|0.00575556| |km000|3.98273| |kn000|2.36041| |ko000|2774.28| |ko100|5018.29| |ko101|5048.49| |ko102|5018.27| |ko103|2587.85| |ks000|0.0150444| |ku000|1.93419| |ky000|14.3917| |la000|7.26088| |lb000|0.1115| |lg000|0.00386111| |ln000|0.188739| |lo000|0.230986| |lt000|17.6507| |lv000|2.47671| |mg000|0.169653| |mi000|1.10089| |mk000|5.54236| |ml000|13.2386| |mn000|2.0232| |mr000|7.11602| |ms000|28.0219| |my000|2.35663| |na000|0.0397056| |nd000|0.00111111| |ne000|2.34936| |nl000|413.044| |nl100|2490.13| |no000|129.183| |nv000|0.00319444| |oc000|0.166108| |om000|0.148478| |or000|0.421436| |pa000|1.58188| |pl000|757.986| |ps000|0.9871| |pt000|1631.44| |pt100|5044.57| |pt101|5038.33| |pt102|5041.59| |pt103|3553.28| |qu000|0.748772| |rm000|0.192933| |rn000|0.00401111| |ro000|99.9175| |ru000|4968.37| |ru001|627.679| |ru100|5098.3| |ru101|5098| |ru102|5119.43| |ru103|5107.29| |ru104|5121.73| |ru105|5088.05| |ru106|3393.44| |rw000|0.640825| |sa000|0.354139| |sc000|0.00801111| |sd000|0.0768722| |sg000|0.000472222| |sh000|0.250914| |si000|4.2634| |sk000|30.0155| |sl000|22.9366| |sm000|0.102333| |sn000|0.0134722| |so000|3.36819| |sq000|3.48276| |sr000|15.2849| |st000|0.00324167| |su000|0.0404639| |sv000|127.411| |sw000|1.93409| |ta000|59.4805| |te000|5.66794| |tg000|0.272386| |th000|497.14| |th100|1.87429| |ti000|0.343897| |tk000|0.0651806| |tn000|0.112181| |to000|0.000555556| |tr000|588.698| |tr100|4067.68| |ts000|0.00111111| |tt000|0.0441194| |ug000|0.0905| |uk000|396.598| |uk100|450.411| |ur000|22.4373| |uz000|5.29325| |ve000|0.00355278| |vi000|779.854| |vi100|4963.77| |vi101|4239.37| |vo000|0.209436| |wo000|0.0801528| |xh000|0.126628| |yi000|0.0810111| |yo000|0.322206| |zh000|299.368| |zu000|0.139931|

提供机构：

espnet

原始信息汇总

数据集概述

数据集名称

YODAS manual/automatic subset

数据量

包含369,510小时的语音数据。

搜集汇总

数据集介绍

构建方式

在语音识别与音频处理领域，大规模多语言数据集的构建对模型性能提升至关重要。YODAS数据集通过系统化采集YouTube平台上的公开视频资源，依据知识共享许可协议筛选内容，确保了数据来源的合法性与多样性。其构建过程涉及自动化的音频提取与文本对齐，将原始视频流转化为包含语音片段及对应字幕的标准化样本。数据集按语言与字幕类型进行精细分片，每个分片容量控制在500GB以内，便于分布式处理与存储，最终形成了覆盖149种语言、总时长约36.9万小时的庞大语料库。

特点

作为面向音频与语音研究的专项数据集，YODAS的突出特点在于其规模宏大且语言覆盖广泛。数据集不仅包含用户上传的手动字幕，还整合了平台生成的自动字幕，通过分片命名规则清晰区分两类文本来源。各语言分片间样本无重复，保证了数据集的纯净度与统计独立性。此外，数据以16kHz采样率保存，兼顾了语音信号的保真度与存储效率，为多语种语音识别、语音合成及跨语言建模提供了丰富的训练资源。

使用方法

针对数据集的庞大规模，YODAS提供了标准与流式两种加载模式。标准模式需将选定分片完整下载至本地，适用于需要反复访问数据的离线训练场景。流式模式则支持实时数据流读取，无需本地存储全部文件，便于快速探查数据集内容或进行轻量级实验。用户可通过Hugging Face的datasets库指定语言分片标识符灵活调用，例如加载英语手动字幕分片时使用'en000'参数。这种设计平衡了数据访问的便捷性与系统资源消耗，适应不同规模的研究需求。

背景与挑战

背景概述

在语音识别与音频处理领域，大规模多语言语音数据集的构建对于推动端到端模型的发展至关重要。YODAS数据集由卡内基梅隆大学等机构的研究团队于2023年创建，旨在从YouTube平台采集海量语音音频及其对应字幕，以解决多语言语音识别、语音合成等任务中数据稀缺的瓶颈。该数据集涵盖149种语言，总时长超过36万小时，通过区分手动与自动生成的字幕子集，为模型训练提供了丰富的监督信号，显著提升了语音技术在全球语言覆盖上的能力。

当前挑战

YODAS数据集致力于应对多语言语音识别中数据分布不均与质量参差的根本挑战，尤其在小语种和低资源场景下，语音与文本的对齐精度难以保障。在构建过程中，团队面临海量音频数据的采集、存储与处理的工程难题，需设计高效的数据流式加载机制以缓解本地存储压力。同时，确保数据版权合规性，并处理用户生成字幕中存在的噪声与错误，亦是数据集质量控制的关键环节。

常用场景

经典使用场景

在语音技术领域，大规模多语言语音数据的稀缺性长期制约着模型的泛化能力与性能提升。YODAS数据集以其覆盖149种语言、总计超过36万小时的庞大规模，为语音识别、语音合成及跨语言语音处理研究提供了宝贵的资源。该数据集通过整合YouTube平台上的用户上传音频与对应字幕，构建了一个高度多样化的语音语料库，尤其适用于训练端到端的自动语音识别系统，能够有效捕捉真实世界中的语音变异、背景噪声及多说话人场景，从而推动语音技术向更实用、更鲁棒的方向发展。

衍生相关工作

自YODAS数据集发布以来，已催生了一系列重要的衍生研究工作。这些工作主要集中在利用其超大规模和多语言特性进行语音基础模型预训练，例如，基于YODAS训练的通用语音编码器被证明在多种下游任务上具有卓越的迁移性能。同时，该数据集也促进了针对数据清洗、质量评估以及高效利用策略的方法学研究。部分研究进一步利用其字幕信息，探索语音-文本对齐、视听多模态学习等交叉领域。这些经典工作不仅验证了YODAS的数据价值，也推动了整个语音社区向更大规模、更高质量数据驱动的范式转变。

数据集最近研究