five

suryatmodulus/yodas2_sidon

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/suryatmodulus/yodas2_sidon
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-3.0 task_categories: - text-to-speech - automatic-speech-recognition language: - aa - ab - af - ak - am - ar - as - ay - az - ba - be - bg - bh - bi - bm - bn - bo - br - bs - ca - co - cr - cs - cy - da - de - dz - ee - el - en - eo - es - et - eu - fa - ff - fi - fj - fo - fr - fy - ga - gd - gl - gn - gu - ha - hi - ho - hr - ht - hu - hy - ia - id - ie - ig - ik - is - it - iu - iw - ja - jv - ka - ki - kk - kl - km - kn - ko - ks - ku - ky - la - lb - lg - ln - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - my - na - nd - ne - nl - no - nv - oc - om - or - pa - pl - ps - pt - qu - rm - rn - ro - ru - rw - sa - sc - sd - sg - sh - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - ti - tk - tn - to - tr - ts - tt - ug - uk - ur - uz - ve - vi - vo - wo - xh - yi - yo - zh - zu tags: - speech - synthetic - youtube - yodas size_categories: - 100M<n<1B configs: - config_name: aa000 data_files: - split: train path: aa000/*.tar.gz - config_name: ab000 data_files: - split: train path: ab000/*.tar.gz - config_name: af000 data_files: - split: train path: af000/*.tar.gz - config_name: ak000 data_files: - split: train path: ak000/*.tar.gz - config_name: am000 data_files: - split: train path: am000/*.tar.gz - config_name: ar000 data_files: - split: train path: ar000/*.tar.gz - config_name: as000 data_files: - split: train path: as000/*.tar.gz - config_name: ay000 data_files: - split: train path: ay000/*.tar.gz - config_name: az000 data_files: - split: train path: az000/*.tar.gz - config_name: ba000 data_files: - split: train path: ba000/*.tar.gz - config_name: be000 data_files: - split: train path: be000/*.tar.gz - config_name: bg000 data_files: - split: train path: bg000/*.tar.gz - config_name: bh000 data_files: - split: train path: bh000/*.tar.gz - config_name: bi000 data_files: - split: train path: bi000/*.tar.gz - config_name: bm000 data_files: - split: train path: bm000/*.tar.gz - config_name: bn000 data_files: - split: train path: bn000/*.tar.gz - config_name: bo000 data_files: - split: train path: bo000/*.tar.gz - config_name: br000 data_files: - split: train path: br000/*.tar.gz - config_name: bs000 data_files: - split: train path: bs000/*.tar.gz - config_name: ca000 data_files: - split: train path: ca000/*.tar.gz - config_name: co000 data_files: - split: train path: co000/*.tar.gz - config_name: cr000 data_files: - split: train path: cr000/*.tar.gz - config_name: cs000 data_files: - split: train path: cs000/*.tar.gz - config_name: cy000 data_files: - split: train path: cy000/*.tar.gz - config_name: da000 data_files: - split: train path: da000/*.tar.gz - config_name: de000 data_files: - split: train path: de000/*.tar.gz - config_name: de100 data_files: - split: train path: de100/*.tar.gz - config_name: de101 data_files: - split: train path: de101/*.tar.gz - config_name: de102 data_files: - split: train path: de102/*.tar.gz - config_name: dz000 data_files: - split: train path: dz000/*.tar.gz - config_name: ee000 data_files: - split: train path: ee000/*.tar.gz - config_name: el000 data_files: - split: train path: el000/*.tar.gz - config_name: en000 data_files: - split: train path: en000/*.tar.gz - config_name: en001 data_files: - split: train path: en001/*.tar.gz - config_name: en002 data_files: - split: train path: en002/*.tar.gz - config_name: en003 data_files: - split: train path: en003/*.tar.gz - config_name: en004 data_files: - split: train path: en004/*.tar.gz - config_name: en005 data_files: - split: train path: en005/*.tar.gz - config_name: en100 data_files: - split: train path: en100/*.tar.gz - config_name: en101 data_files: - split: train path: en101/*.tar.gz - config_name: en102 data_files: - split: train path: en102/*.tar.gz - config_name: en103 data_files: - split: train path: en103/*.tar.gz - config_name: en104 data_files: - split: train path: en104/*.tar.gz - config_name: en105 data_files: - split: train path: en105/*.tar.gz - config_name: en106 data_files: - split: train path: en106/*.tar.gz - config_name: en107 data_files: - split: train path: en107/*.tar.gz - config_name: en108 data_files: - split: train path: en108/*.tar.gz - config_name: en109 data_files: - split: train path: en109/*.tar.gz - config_name: en110 data_files: - split: train path: en110/*.tar.gz - config_name: en111 data_files: - split: train path: en111/*.tar.gz - config_name: en112 data_files: - split: train path: en112/*.tar.gz - config_name: en113 data_files: - split: train path: en113/*.tar.gz - config_name: en114 data_files: - split: train path: en114/*.tar.gz - config_name: en115 data_files: - split: train path: en115/*.tar.gz - config_name: en116 data_files: - split: train path: en116/*.tar.gz - config_name: en117 data_files: - split: train path: en117/*.tar.gz - config_name: en118 data_files: - split: train path: en118/*.tar.gz - config_name: en119 data_files: - split: train path: en119/*.tar.gz - config_name: en120 data_files: - split: train path: en120/*.tar.gz - config_name: en121 data_files: - split: train path: en121/*.tar.gz - config_name: en122 data_files: - split: train path: en122/*.tar.gz - config_name: en123 data_files: - split: train path: en123/*.tar.gz - config_name: en124 data_files: - split: train path: en124/*.tar.gz - config_name: en125 data_files: - split: train path: en125/*.tar.gz - config_name: en126 data_files: - split: train path: en126/*.tar.gz - config_name: en127 data_files: - split: train path: en127/*.tar.gz - config_name: eo000 data_files: - split: train path: eo000/*.tar.gz - config_name: es000 data_files: - split: train path: es000/*.tar.gz - config_name: es100 data_files: - split: train path: es100/*.tar.gz - config_name: es101 data_files: - split: train path: es101/*.tar.gz - config_name: es102 data_files: - split: train path: es102/*.tar.gz - config_name: es103 data_files: - split: train path: es103/*.tar.gz - config_name: es104 data_files: - split: train path: es104/*.tar.gz - config_name: es105 data_files: - split: train path: es105/*.tar.gz - config_name: es106 data_files: - split: train path: es106/*.tar.gz - config_name: et000 data_files: - split: train path: et000/*.tar.gz - config_name: eu000 data_files: - split: train path: eu000/*.tar.gz - config_name: fa000 data_files: - split: train path: fa000/*.tar.gz - config_name: ff000 data_files: - split: train path: ff000/*.tar.gz - config_name: fi000 data_files: - split: train path: fi000/*.tar.gz - config_name: fj000 data_files: - split: train path: fj000/*.tar.gz - config_name: fo000 data_files: - split: train path: fo000/*.tar.gz - config_name: fr000 data_files: - split: train path: fr000/*.tar.gz - config_name: fr100 data_files: - split: train path: fr100/*.tar.gz - config_name: fr101 data_files: - split: train path: fr101/*.tar.gz - config_name: fr102 data_files: - split: train path: fr102/*.tar.gz - config_name: fr103 data_files: - split: train path: fr103/*.tar.gz - config_name: fy000 data_files: - split: train path: fy000/*.tar.gz - config_name: ga000 data_files: - split: train path: ga000/*.tar.gz - config_name: gd000 data_files: - split: train path: gd000/*.tar.gz - config_name: gl000 data_files: - split: train path: gl000/*.tar.gz - config_name: gn000 data_files: - split: train path: gn000/*.tar.gz - config_name: gu000 data_files: - split: train path: gu000/*.tar.gz - config_name: ha000 data_files: - split: train path: ha000/*.tar.gz - config_name: hi000 data_files: - split: train path: hi000/*.tar.gz - config_name: hi100 data_files: - split: train path: hi100/*.tar.gz - config_name: ho000 data_files: - split: train path: ho000/*.tar.gz - config_name: hr000 data_files: - split: train path: hr000/*.tar.gz - config_name: ht000 data_files: - split: train path: ht000/*.tar.gz - config_name: hu000 data_files: - split: train path: hu000/*.tar.gz - config_name: hy000 data_files: - split: train path: hy000/*.tar.gz - config_name: ia000 data_files: - split: train path: ia000/*.tar.gz - config_name: id000 data_files: - split: train path: id000/*.tar.gz - config_name: id100 data_files: - split: train path: id100/*.tar.gz - config_name: id101 data_files: - split: train path: id101/*.tar.gz - config_name: ie000 data_files: - split: train path: ie000/*.tar.gz - config_name: ig000 data_files: - split: train path: ig000/*.tar.gz - config_name: ik000 data_files: - split: train path: ik000/*.tar.gz - config_name: is000 data_files: - split: train path: is000/*.tar.gz - config_name: it000 data_files: - split: train path: it000/*.tar.gz - config_name: it100 data_files: - split: train path: it100/*.tar.gz - config_name: it101 data_files: - split: train path: it101/*.tar.gz - config_name: iu000 data_files: - split: train path: iu000/*.tar.gz - config_name: iw000 data_files: - split: train path: iw000/*.tar.gz - config_name: ja000 data_files: - split: train path: ja000/*.tar.gz - config_name: ja100 data_files: - split: train path: ja100/*.tar.gz - config_name: jv000 data_files: - split: train path: jv000/*.tar.gz - config_name: ka000 data_files: - split: train path: ka000/*.tar.gz - config_name: ki000 data_files: - split: train path: ki000/*.tar.gz - config_name: kk000 data_files: - split: train path: kk000/*.tar.gz - config_name: kl000 data_files: - split: train path: kl000/*.tar.gz - config_name: km000 data_files: - split: train path: km000/*.tar.gz - config_name: kn000 data_files: - split: train path: kn000/*.tar.gz - config_name: ko000 data_files: - split: train path: ko000/*.tar.gz - config_name: ko100 data_files: - split: train path: ko100/*.tar.gz - config_name: ko101 data_files: - split: train path: ko101/*.tar.gz - config_name: ko102 data_files: - split: train path: ko102/*.tar.gz - config_name: ko103 data_files: - split: train path: ko103/*.tar.gz - config_name: ks000 data_files: - split: train path: ks000/*.tar.gz - config_name: ku000 data_files: - split: train path: ku000/*.tar.gz - config_name: ky000 data_files: - split: train path: ky000/*.tar.gz - config_name: la000 data_files: - split: train path: la000/*.tar.gz - config_name: lb000 data_files: - split: train path: lb000/*.tar.gz - config_name: lg000 data_files: - split: train path: lg000/*.tar.gz - config_name: ln000 data_files: - split: train path: ln000/*.tar.gz - config_name: lo000 data_files: - split: train path: lo000/*.tar.gz - config_name: lt000 data_files: - split: train path: lt000/*.tar.gz - config_name: lv000 data_files: - split: train path: lv000/*.tar.gz - config_name: mg000 data_files: - split: train path: mg000/*.tar.gz - config_name: mi000 data_files: - split: train path: mi000/*.tar.gz - config_name: mk000 data_files: - split: train path: mk000/*.tar.gz - config_name: ml000 data_files: - split: train path: ml000/*.tar.gz - config_name: mn000 data_files: - split: train path: mn000/*.tar.gz - config_name: mr000 data_files: - split: train path: mr000/*.tar.gz - config_name: ms000 data_files: - split: train path: ms000/*.tar.gz - config_name: my000 data_files: - split: train path: my000/*.tar.gz - config_name: na000 data_files: - split: train path: na000/*.tar.gz - config_name: nd000 data_files: - split: train path: nd000/*.tar.gz - config_name: ne000 data_files: - split: train path: ne000/*.tar.gz - config_name: nl000 data_files: - split: train path: nl000/*.tar.gz - config_name: nl100 data_files: - split: train path: nl100/*.tar.gz - config_name: no000 data_files: - split: train path: no000/*.tar.gz - config_name: nv000 data_files: - split: train path: nv000/*.tar.gz - config_name: oc000 data_files: - split: train path: oc000/*.tar.gz - config_name: om000 data_files: - split: train path: om000/*.tar.gz - config_name: or000 data_files: - split: train path: or000/*.tar.gz - config_name: pa000 data_files: - split: train path: pa000/*.tar.gz - config_name: pl000 data_files: - split: train path: pl000/*.tar.gz - config_name: ps000 data_files: - split: train path: ps000/*.tar.gz - config_name: pt000 data_files: - split: train path: pt000/*.tar.gz - config_name: pt100 data_files: - split: train path: pt100/*.tar.gz - config_name: pt101 data_files: - split: train path: pt101/*.tar.gz - config_name: pt102 data_files: - split: train path: pt102/*.tar.gz - config_name: pt103 data_files: - split: train path: pt103/*.tar.gz - config_name: qu000 data_files: - split: train path: qu000/*.tar.gz - config_name: rm000 data_files: - split: train path: rm000/*.tar.gz - config_name: rn000 data_files: - split: train path: rn000/*.tar.gz - config_name: ro000 data_files: - split: train path: ro000/*.tar.gz - config_name: ru000 data_files: - split: train path: ru000/*.tar.gz - config_name: ru001 data_files: - split: train path: ru001/*.tar.gz - config_name: ru100 data_files: - split: train path: ru100/*.tar.gz - config_name: ru101 data_files: - split: train path: ru101/*.tar.gz - config_name: ru102 data_files: - split: train path: ru102/*.tar.gz - config_name: ru103 data_files: - split: train path: ru103/*.tar.gz - config_name: ru104 data_files: - split: train path: ru104/*.tar.gz - config_name: ru105 data_files: - split: train path: ru105/*.tar.gz - config_name: ru106 data_files: - split: train path: ru106/*.tar.gz - config_name: rw000 data_files: - split: train path: rw000/*.tar.gz - config_name: sa000 data_files: - split: train path: sa000/*.tar.gz - config_name: sc000 data_files: - split: train path: sc000/*.tar.gz - config_name: sd000 data_files: - split: train path: sd000/*.tar.gz - config_name: sg000 data_files: - split: train path: sg000/*.tar.gz - config_name: sh000 data_files: - split: train path: sh000/*.tar.gz - config_name: si000 data_files: - split: train path: si000/*.tar.gz - config_name: sk000 data_files: - split: train path: sk000/*.tar.gz - config_name: sl000 data_files: - split: train path: sl000/*.tar.gz - config_name: sm000 data_files: - split: train path: sm000/*.tar.gz - config_name: sn000 data_files: - split: train path: sn000/*.tar.gz - config_name: so000 data_files: - split: train path: so000/*.tar.gz - config_name: sq000 data_files: - split: train path: sq000/*.tar.gz - config_name: sr000 data_files: - split: train path: sr000/*.tar.gz - config_name: st000 data_files: - split: train path: st000/*.tar.gz - config_name: su000 data_files: - split: train path: su000/*.tar.gz - config_name: sv000 data_files: - split: train path: sv000/*.tar.gz - config_name: sw000 data_files: - split: train path: sw000/*.tar.gz - config_name: ta000 data_files: - split: train path: ta000/*.tar.gz - config_name: te000 data_files: - split: train path: te000/*.tar.gz - config_name: tg000 data_files: - split: train path: tg000/*.tar.gz - config_name: th000 data_files: - split: train path: th000/*.tar.gz - config_name: th100 data_files: - split: train path: th100/*.tar.gz - config_name: ti000 data_files: - split: train path: ti000/*.tar.gz - config_name: tk000 data_files: - split: train path: tk000/*.tar.gz - config_name: tn000 data_files: - split: train path: tn000/*.tar.gz - config_name: to000 data_files: - split: train path: to000/*.tar.gz - config_name: tr000 data_files: - split: train path: tr000/*.tar.gz - config_name: tr100 data_files: - split: train path: tr100/*.tar.gz - config_name: ts000 data_files: - split: train path: ts000/*.tar.gz - config_name: tt000 data_files: - split: train path: tt000/*.tar.gz - config_name: ug000 data_files: - split: train path: ug000/*.tar.gz - config_name: uk000 data_files: - split: train path: uk000/*.tar.gz - config_name: uk100 data_files: - split: train path: uk100/*.tar.gz - config_name: ur000 data_files: - split: train path: ur000/*.tar.gz - config_name: uz000 data_files: - split: train path: uz000/*.tar.gz - config_name: ve000 data_files: - split: train path: ve000/*.tar.gz - config_name: vi000 data_files: - split: train path: vi000/*.tar.gz - config_name: vi100 data_files: - split: train path: vi100/*.tar.gz - config_name: vo000 data_files: - split: train path: vo000/*.tar.gz - config_name: wo000 data_files: - split: train path: wo000/*.tar.gz - config_name: xh000 data_files: - split: train path: xh000/*.tar.gz - config_name: yi000 data_files: - split: train path: yi000/*.tar.gz - config_name: yo000 data_files: - split: train path: yo000/*.tar.gz - config_name: zh000 data_files: - split: train path: zh000/*.tar.gz - config_name: zu000 data_files: - split: train path: zu000/*.tar.gz --- # YODAS2-Sidon ## Overview This dataset is a **cleansed version of YODAS-2** with **Sidon** speech restoration mode for **Speech Synthesis** and **Spoken Language Modeling**. YODAS-2 is a massive, multilingual YouTube-derived dataset. We have applied the Sidon restoration model to remove background noise and enhance audio quality, making it suitable for high-quality generation tasks. We resampled original sidon output to 24kHz due to a storage constraints. The dataset is provided in **[WebDataset](https://github.com/webdataset/webdataset) format** for efficient large-scale training. - **Source**: [YODAS-2 (YouTube-Oriented Dataset for Audio-Visual Speech)](https://huggingface.co/datasets/espnet/yodas2) - **Format**: WebDataset (`.tar.gz` shards) - **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/) --- ## Dataset Structure Each sample in the dataset contains: - **`flac`** — audio file (24 kHz, single channel, restored) - **`metadata.json`** *(optional)* — metadata including language, YouTube video ID, and transcription Example (inside a `.tar` shard): ``` 000001.flac 000001.metadata.json 000002.flac 000002.metadata.json ... ```` --- ## How to Use ### With 🤗 Datasets You can load the WebDataset directly with Hugging Face’s `datasets` library: ```python from datasets import load_dataset from huggingface_hub import list_repo_files repo_id = "sarulab-speech/yodas2_sidon" subset="en000" all_files = list_repo_files(repo_id, repo_type="dataset") urls = [ f"https://huggingface.co/datasets/{repo_id}/resolve/main/{f}" for f in sorted(all_files) if f.endswith(".tar.gz") and f.startswith(subset) ] print(f"Found {len(urls)} shards.") dataset = load_dataset( "webdataset", data_files={"train": urls}, streaming=True )['train'] from IPython.display import Audio sample = next(iter(dataset)) audio = sample['flac'] print(sample['metadata.json']) Audio(audio['array'], rate=audio['sampling_rate']) ```` Replace `subset` with the desired subset. ----- ## Citation If you use this dataset, please cite Sidon and the original YODAS paper: ``` @misc{nakata2025sidonfastrobustopensource, title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing}, author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari}, year={2025}, eprint={2509.17052}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={[https://arxiv.org/abs/2509.17052](https://arxiv.org/abs/2509.17052)}, } ``` ``` @inproceedings{li2023yodas, title={Yodas: Youtube-Oriented Dataset for Audio and Speech}, author={Li, Xinjian and Takamichi, Shinnosuke and Saeki, Takaaki and Chen, William and Shiota, Sayaka and Watanabe, Shinji}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } ``` ----- ## License This dataset is released under [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/). ----- ## Acknowledgements * **Original data**: [YODAS2](https://huggingface.co/datasets/espnet/yodas2)
提供机构:
suryatmodulus
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作