suryatmodulus/yodas2_sidon
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/suryatmodulus/yodas2_sidon
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-3.0
task_categories:
- text-to-speech
- automatic-speech-recognition
language:
- aa
- ab
- af
- ak
- am
- ar
- as
- ay
- az
- ba
- be
- bg
- bh
- bi
- bm
- bn
- bo
- br
- bs
- ca
- co
- cr
- cs
- cy
- da
- de
- dz
- ee
- el
- en
- eo
- es
- et
- eu
- fa
- ff
- fi
- fj
- fo
- fr
- fy
- ga
- gd
- gl
- gn
- gu
- ha
- hi
- ho
- hr
- ht
- hu
- hy
- ia
- id
- ie
- ig
- ik
- is
- it
- iu
- iw
- ja
- jv
- ka
- ki
- kk
- kl
- km
- kn
- ko
- ks
- ku
- ky
- la
- lb
- lg
- ln
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- my
- na
- nd
- ne
- nl
- no
- nv
- oc
- om
- or
- pa
- pl
- ps
- pt
- qu
- rm
- rn
- ro
- ru
- rw
- sa
- sc
- sd
- sg
- sh
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- ti
- tk
- tn
- to
- tr
- ts
- tt
- ug
- uk
- ur
- uz
- ve
- vi
- vo
- wo
- xh
- yi
- yo
- zh
- zu
tags:
- speech
- synthetic
- youtube
- yodas
size_categories:
- 100M<n<1B
configs:
- config_name: aa000
data_files:
- split: train
path: aa000/*.tar.gz
- config_name: ab000
data_files:
- split: train
path: ab000/*.tar.gz
- config_name: af000
data_files:
- split: train
path: af000/*.tar.gz
- config_name: ak000
data_files:
- split: train
path: ak000/*.tar.gz
- config_name: am000
data_files:
- split: train
path: am000/*.tar.gz
- config_name: ar000
data_files:
- split: train
path: ar000/*.tar.gz
- config_name: as000
data_files:
- split: train
path: as000/*.tar.gz
- config_name: ay000
data_files:
- split: train
path: ay000/*.tar.gz
- config_name: az000
data_files:
- split: train
path: az000/*.tar.gz
- config_name: ba000
data_files:
- split: train
path: ba000/*.tar.gz
- config_name: be000
data_files:
- split: train
path: be000/*.tar.gz
- config_name: bg000
data_files:
- split: train
path: bg000/*.tar.gz
- config_name: bh000
data_files:
- split: train
path: bh000/*.tar.gz
- config_name: bi000
data_files:
- split: train
path: bi000/*.tar.gz
- config_name: bm000
data_files:
- split: train
path: bm000/*.tar.gz
- config_name: bn000
data_files:
- split: train
path: bn000/*.tar.gz
- config_name: bo000
data_files:
- split: train
path: bo000/*.tar.gz
- config_name: br000
data_files:
- split: train
path: br000/*.tar.gz
- config_name: bs000
data_files:
- split: train
path: bs000/*.tar.gz
- config_name: ca000
data_files:
- split: train
path: ca000/*.tar.gz
- config_name: co000
data_files:
- split: train
path: co000/*.tar.gz
- config_name: cr000
data_files:
- split: train
path: cr000/*.tar.gz
- config_name: cs000
data_files:
- split: train
path: cs000/*.tar.gz
- config_name: cy000
data_files:
- split: train
path: cy000/*.tar.gz
- config_name: da000
data_files:
- split: train
path: da000/*.tar.gz
- config_name: de000
data_files:
- split: train
path: de000/*.tar.gz
- config_name: de100
data_files:
- split: train
path: de100/*.tar.gz
- config_name: de101
data_files:
- split: train
path: de101/*.tar.gz
- config_name: de102
data_files:
- split: train
path: de102/*.tar.gz
- config_name: dz000
data_files:
- split: train
path: dz000/*.tar.gz
- config_name: ee000
data_files:
- split: train
path: ee000/*.tar.gz
- config_name: el000
data_files:
- split: train
path: el000/*.tar.gz
- config_name: en000
data_files:
- split: train
path: en000/*.tar.gz
- config_name: en001
data_files:
- split: train
path: en001/*.tar.gz
- config_name: en002
data_files:
- split: train
path: en002/*.tar.gz
- config_name: en003
data_files:
- split: train
path: en003/*.tar.gz
- config_name: en004
data_files:
- split: train
path: en004/*.tar.gz
- config_name: en005
data_files:
- split: train
path: en005/*.tar.gz
- config_name: en100
data_files:
- split: train
path: en100/*.tar.gz
- config_name: en101
data_files:
- split: train
path: en101/*.tar.gz
- config_name: en102
data_files:
- split: train
path: en102/*.tar.gz
- config_name: en103
data_files:
- split: train
path: en103/*.tar.gz
- config_name: en104
data_files:
- split: train
path: en104/*.tar.gz
- config_name: en105
data_files:
- split: train
path: en105/*.tar.gz
- config_name: en106
data_files:
- split: train
path: en106/*.tar.gz
- config_name: en107
data_files:
- split: train
path: en107/*.tar.gz
- config_name: en108
data_files:
- split: train
path: en108/*.tar.gz
- config_name: en109
data_files:
- split: train
path: en109/*.tar.gz
- config_name: en110
data_files:
- split: train
path: en110/*.tar.gz
- config_name: en111
data_files:
- split: train
path: en111/*.tar.gz
- config_name: en112
data_files:
- split: train
path: en112/*.tar.gz
- config_name: en113
data_files:
- split: train
path: en113/*.tar.gz
- config_name: en114
data_files:
- split: train
path: en114/*.tar.gz
- config_name: en115
data_files:
- split: train
path: en115/*.tar.gz
- config_name: en116
data_files:
- split: train
path: en116/*.tar.gz
- config_name: en117
data_files:
- split: train
path: en117/*.tar.gz
- config_name: en118
data_files:
- split: train
path: en118/*.tar.gz
- config_name: en119
data_files:
- split: train
path: en119/*.tar.gz
- config_name: en120
data_files:
- split: train
path: en120/*.tar.gz
- config_name: en121
data_files:
- split: train
path: en121/*.tar.gz
- config_name: en122
data_files:
- split: train
path: en122/*.tar.gz
- config_name: en123
data_files:
- split: train
path: en123/*.tar.gz
- config_name: en124
data_files:
- split: train
path: en124/*.tar.gz
- config_name: en125
data_files:
- split: train
path: en125/*.tar.gz
- config_name: en126
data_files:
- split: train
path: en126/*.tar.gz
- config_name: en127
data_files:
- split: train
path: en127/*.tar.gz
- config_name: eo000
data_files:
- split: train
path: eo000/*.tar.gz
- config_name: es000
data_files:
- split: train
path: es000/*.tar.gz
- config_name: es100
data_files:
- split: train
path: es100/*.tar.gz
- config_name: es101
data_files:
- split: train
path: es101/*.tar.gz
- config_name: es102
data_files:
- split: train
path: es102/*.tar.gz
- config_name: es103
data_files:
- split: train
path: es103/*.tar.gz
- config_name: es104
data_files:
- split: train
path: es104/*.tar.gz
- config_name: es105
data_files:
- split: train
path: es105/*.tar.gz
- config_name: es106
data_files:
- split: train
path: es106/*.tar.gz
- config_name: et000
data_files:
- split: train
path: et000/*.tar.gz
- config_name: eu000
data_files:
- split: train
path: eu000/*.tar.gz
- config_name: fa000
data_files:
- split: train
path: fa000/*.tar.gz
- config_name: ff000
data_files:
- split: train
path: ff000/*.tar.gz
- config_name: fi000
data_files:
- split: train
path: fi000/*.tar.gz
- config_name: fj000
data_files:
- split: train
path: fj000/*.tar.gz
- config_name: fo000
data_files:
- split: train
path: fo000/*.tar.gz
- config_name: fr000
data_files:
- split: train
path: fr000/*.tar.gz
- config_name: fr100
data_files:
- split: train
path: fr100/*.tar.gz
- config_name: fr101
data_files:
- split: train
path: fr101/*.tar.gz
- config_name: fr102
data_files:
- split: train
path: fr102/*.tar.gz
- config_name: fr103
data_files:
- split: train
path: fr103/*.tar.gz
- config_name: fy000
data_files:
- split: train
path: fy000/*.tar.gz
- config_name: ga000
data_files:
- split: train
path: ga000/*.tar.gz
- config_name: gd000
data_files:
- split: train
path: gd000/*.tar.gz
- config_name: gl000
data_files:
- split: train
path: gl000/*.tar.gz
- config_name: gn000
data_files:
- split: train
path: gn000/*.tar.gz
- config_name: gu000
data_files:
- split: train
path: gu000/*.tar.gz
- config_name: ha000
data_files:
- split: train
path: ha000/*.tar.gz
- config_name: hi000
data_files:
- split: train
path: hi000/*.tar.gz
- config_name: hi100
data_files:
- split: train
path: hi100/*.tar.gz
- config_name: ho000
data_files:
- split: train
path: ho000/*.tar.gz
- config_name: hr000
data_files:
- split: train
path: hr000/*.tar.gz
- config_name: ht000
data_files:
- split: train
path: ht000/*.tar.gz
- config_name: hu000
data_files:
- split: train
path: hu000/*.tar.gz
- config_name: hy000
data_files:
- split: train
path: hy000/*.tar.gz
- config_name: ia000
data_files:
- split: train
path: ia000/*.tar.gz
- config_name: id000
data_files:
- split: train
path: id000/*.tar.gz
- config_name: id100
data_files:
- split: train
path: id100/*.tar.gz
- config_name: id101
data_files:
- split: train
path: id101/*.tar.gz
- config_name: ie000
data_files:
- split: train
path: ie000/*.tar.gz
- config_name: ig000
data_files:
- split: train
path: ig000/*.tar.gz
- config_name: ik000
data_files:
- split: train
path: ik000/*.tar.gz
- config_name: is000
data_files:
- split: train
path: is000/*.tar.gz
- config_name: it000
data_files:
- split: train
path: it000/*.tar.gz
- config_name: it100
data_files:
- split: train
path: it100/*.tar.gz
- config_name: it101
data_files:
- split: train
path: it101/*.tar.gz
- config_name: iu000
data_files:
- split: train
path: iu000/*.tar.gz
- config_name: iw000
data_files:
- split: train
path: iw000/*.tar.gz
- config_name: ja000
data_files:
- split: train
path: ja000/*.tar.gz
- config_name: ja100
data_files:
- split: train
path: ja100/*.tar.gz
- config_name: jv000
data_files:
- split: train
path: jv000/*.tar.gz
- config_name: ka000
data_files:
- split: train
path: ka000/*.tar.gz
- config_name: ki000
data_files:
- split: train
path: ki000/*.tar.gz
- config_name: kk000
data_files:
- split: train
path: kk000/*.tar.gz
- config_name: kl000
data_files:
- split: train
path: kl000/*.tar.gz
- config_name: km000
data_files:
- split: train
path: km000/*.tar.gz
- config_name: kn000
data_files:
- split: train
path: kn000/*.tar.gz
- config_name: ko000
data_files:
- split: train
path: ko000/*.tar.gz
- config_name: ko100
data_files:
- split: train
path: ko100/*.tar.gz
- config_name: ko101
data_files:
- split: train
path: ko101/*.tar.gz
- config_name: ko102
data_files:
- split: train
path: ko102/*.tar.gz
- config_name: ko103
data_files:
- split: train
path: ko103/*.tar.gz
- config_name: ks000
data_files:
- split: train
path: ks000/*.tar.gz
- config_name: ku000
data_files:
- split: train
path: ku000/*.tar.gz
- config_name: ky000
data_files:
- split: train
path: ky000/*.tar.gz
- config_name: la000
data_files:
- split: train
path: la000/*.tar.gz
- config_name: lb000
data_files:
- split: train
path: lb000/*.tar.gz
- config_name: lg000
data_files:
- split: train
path: lg000/*.tar.gz
- config_name: ln000
data_files:
- split: train
path: ln000/*.tar.gz
- config_name: lo000
data_files:
- split: train
path: lo000/*.tar.gz
- config_name: lt000
data_files:
- split: train
path: lt000/*.tar.gz
- config_name: lv000
data_files:
- split: train
path: lv000/*.tar.gz
- config_name: mg000
data_files:
- split: train
path: mg000/*.tar.gz
- config_name: mi000
data_files:
- split: train
path: mi000/*.tar.gz
- config_name: mk000
data_files:
- split: train
path: mk000/*.tar.gz
- config_name: ml000
data_files:
- split: train
path: ml000/*.tar.gz
- config_name: mn000
data_files:
- split: train
path: mn000/*.tar.gz
- config_name: mr000
data_files:
- split: train
path: mr000/*.tar.gz
- config_name: ms000
data_files:
- split: train
path: ms000/*.tar.gz
- config_name: my000
data_files:
- split: train
path: my000/*.tar.gz
- config_name: na000
data_files:
- split: train
path: na000/*.tar.gz
- config_name: nd000
data_files:
- split: train
path: nd000/*.tar.gz
- config_name: ne000
data_files:
- split: train
path: ne000/*.tar.gz
- config_name: nl000
data_files:
- split: train
path: nl000/*.tar.gz
- config_name: nl100
data_files:
- split: train
path: nl100/*.tar.gz
- config_name: no000
data_files:
- split: train
path: no000/*.tar.gz
- config_name: nv000
data_files:
- split: train
path: nv000/*.tar.gz
- config_name: oc000
data_files:
- split: train
path: oc000/*.tar.gz
- config_name: om000
data_files:
- split: train
path: om000/*.tar.gz
- config_name: or000
data_files:
- split: train
path: or000/*.tar.gz
- config_name: pa000
data_files:
- split: train
path: pa000/*.tar.gz
- config_name: pl000
data_files:
- split: train
path: pl000/*.tar.gz
- config_name: ps000
data_files:
- split: train
path: ps000/*.tar.gz
- config_name: pt000
data_files:
- split: train
path: pt000/*.tar.gz
- config_name: pt100
data_files:
- split: train
path: pt100/*.tar.gz
- config_name: pt101
data_files:
- split: train
path: pt101/*.tar.gz
- config_name: pt102
data_files:
- split: train
path: pt102/*.tar.gz
- config_name: pt103
data_files:
- split: train
path: pt103/*.tar.gz
- config_name: qu000
data_files:
- split: train
path: qu000/*.tar.gz
- config_name: rm000
data_files:
- split: train
path: rm000/*.tar.gz
- config_name: rn000
data_files:
- split: train
path: rn000/*.tar.gz
- config_name: ro000
data_files:
- split: train
path: ro000/*.tar.gz
- config_name: ru000
data_files:
- split: train
path: ru000/*.tar.gz
- config_name: ru001
data_files:
- split: train
path: ru001/*.tar.gz
- config_name: ru100
data_files:
- split: train
path: ru100/*.tar.gz
- config_name: ru101
data_files:
- split: train
path: ru101/*.tar.gz
- config_name: ru102
data_files:
- split: train
path: ru102/*.tar.gz
- config_name: ru103
data_files:
- split: train
path: ru103/*.tar.gz
- config_name: ru104
data_files:
- split: train
path: ru104/*.tar.gz
- config_name: ru105
data_files:
- split: train
path: ru105/*.tar.gz
- config_name: ru106
data_files:
- split: train
path: ru106/*.tar.gz
- config_name: rw000
data_files:
- split: train
path: rw000/*.tar.gz
- config_name: sa000
data_files:
- split: train
path: sa000/*.tar.gz
- config_name: sc000
data_files:
- split: train
path: sc000/*.tar.gz
- config_name: sd000
data_files:
- split: train
path: sd000/*.tar.gz
- config_name: sg000
data_files:
- split: train
path: sg000/*.tar.gz
- config_name: sh000
data_files:
- split: train
path: sh000/*.tar.gz
- config_name: si000
data_files:
- split: train
path: si000/*.tar.gz
- config_name: sk000
data_files:
- split: train
path: sk000/*.tar.gz
- config_name: sl000
data_files:
- split: train
path: sl000/*.tar.gz
- config_name: sm000
data_files:
- split: train
path: sm000/*.tar.gz
- config_name: sn000
data_files:
- split: train
path: sn000/*.tar.gz
- config_name: so000
data_files:
- split: train
path: so000/*.tar.gz
- config_name: sq000
data_files:
- split: train
path: sq000/*.tar.gz
- config_name: sr000
data_files:
- split: train
path: sr000/*.tar.gz
- config_name: st000
data_files:
- split: train
path: st000/*.tar.gz
- config_name: su000
data_files:
- split: train
path: su000/*.tar.gz
- config_name: sv000
data_files:
- split: train
path: sv000/*.tar.gz
- config_name: sw000
data_files:
- split: train
path: sw000/*.tar.gz
- config_name: ta000
data_files:
- split: train
path: ta000/*.tar.gz
- config_name: te000
data_files:
- split: train
path: te000/*.tar.gz
- config_name: tg000
data_files:
- split: train
path: tg000/*.tar.gz
- config_name: th000
data_files:
- split: train
path: th000/*.tar.gz
- config_name: th100
data_files:
- split: train
path: th100/*.tar.gz
- config_name: ti000
data_files:
- split: train
path: ti000/*.tar.gz
- config_name: tk000
data_files:
- split: train
path: tk000/*.tar.gz
- config_name: tn000
data_files:
- split: train
path: tn000/*.tar.gz
- config_name: to000
data_files:
- split: train
path: to000/*.tar.gz
- config_name: tr000
data_files:
- split: train
path: tr000/*.tar.gz
- config_name: tr100
data_files:
- split: train
path: tr100/*.tar.gz
- config_name: ts000
data_files:
- split: train
path: ts000/*.tar.gz
- config_name: tt000
data_files:
- split: train
path: tt000/*.tar.gz
- config_name: ug000
data_files:
- split: train
path: ug000/*.tar.gz
- config_name: uk000
data_files:
- split: train
path: uk000/*.tar.gz
- config_name: uk100
data_files:
- split: train
path: uk100/*.tar.gz
- config_name: ur000
data_files:
- split: train
path: ur000/*.tar.gz
- config_name: uz000
data_files:
- split: train
path: uz000/*.tar.gz
- config_name: ve000
data_files:
- split: train
path: ve000/*.tar.gz
- config_name: vi000
data_files:
- split: train
path: vi000/*.tar.gz
- config_name: vi100
data_files:
- split: train
path: vi100/*.tar.gz
- config_name: vo000
data_files:
- split: train
path: vo000/*.tar.gz
- config_name: wo000
data_files:
- split: train
path: wo000/*.tar.gz
- config_name: xh000
data_files:
- split: train
path: xh000/*.tar.gz
- config_name: yi000
data_files:
- split: train
path: yi000/*.tar.gz
- config_name: yo000
data_files:
- split: train
path: yo000/*.tar.gz
- config_name: zh000
data_files:
- split: train
path: zh000/*.tar.gz
- config_name: zu000
data_files:
- split: train
path: zu000/*.tar.gz
---
# YODAS2-Sidon
## Overview
This dataset is a **cleansed version of YODAS-2** with **Sidon** speech restoration mode for **Speech Synthesis** and **Spoken Language Modeling**.
YODAS-2 is a massive, multilingual YouTube-derived dataset. We have applied the Sidon restoration model to remove background noise and enhance audio quality, making it suitable for high-quality generation tasks.
We resampled original sidon output to 24kHz due to a storage constraints.
The dataset is provided in **[WebDataset](https://github.com/webdataset/webdataset) format** for efficient large-scale training.
- **Source**: [YODAS-2 (YouTube-Oriented Dataset for Audio-Visual Speech)](https://huggingface.co/datasets/espnet/yodas2)
- **Format**: WebDataset (`.tar.gz` shards)
- **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/)
---
## Dataset Structure
Each sample in the dataset contains:
- **`flac`** — audio file (24 kHz, single channel, restored)
- **`metadata.json`** *(optional)* — metadata including language, YouTube video ID, and transcription
Example (inside a `.tar` shard):
```
000001.flac
000001.metadata.json
000002.flac
000002.metadata.json
...
````
---
## How to Use
### With 🤗 Datasets
You can load the WebDataset directly with Hugging Face’s `datasets` library:
```python
from datasets import load_dataset
from huggingface_hub import list_repo_files
repo_id = "sarulab-speech/yodas2_sidon"
subset="en000"
all_files = list_repo_files(repo_id, repo_type="dataset")
urls = [
f"https://huggingface.co/datasets/{repo_id}/resolve/main/{f}"
for f in sorted(all_files)
if f.endswith(".tar.gz") and f.startswith(subset)
]
print(f"Found {len(urls)} shards.")
dataset = load_dataset(
"webdataset",
data_files={"train": urls},
streaming=True
)['train']
from IPython.display import Audio
sample = next(iter(dataset))
audio = sample['flac']
print(sample['metadata.json'])
Audio(audio['array'], rate=audio['sampling_rate'])
````
Replace `subset` with the desired subset.
-----
## Citation
If you use this dataset, please cite Sidon and the original YODAS paper:
```
@misc{nakata2025sidonfastrobustopensource,
title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing},
author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari},
year={2025},
eprint={2509.17052},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={[https://arxiv.org/abs/2509.17052](https://arxiv.org/abs/2509.17052)},
}
```
```
@inproceedings{li2023yodas,
title={Yodas: Youtube-Oriented Dataset for Audio and Speech},
author={Li, Xinjian and Takamichi, Shinnosuke and Saeki, Takaaki and Chen, William and Shiota, Sayaka and Watanabe, Shinji},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
pages={1--8},
year={2023},
organization={IEEE}
}
```
-----
## License
This dataset is released under [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/).
-----
## Acknowledgements
* **Original data**: [YODAS2](https://huggingface.co/datasets/espnet/yodas2)
提供机构:
suryatmodulus



