five

TTS-AGI/commonvoice22-sidon-dacvae

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/commonvoice22-sidon-dacvae
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: source: sarulab-speech/commonvoice22_sidon format: WebDataset tar shards with DAC VAE latents license: cc-by-4.0 task_categories: - automatic-speech-recognition - text-to-speech --- # CommonVoice 22 (Sidon-enhanced) converted to DAC VAE latents ## Source [sarulab-speech/commonvoice22_sidon](https://huggingface.co/datasets/sarulab-speech/commonvoice22_sidon) ## Format Each tar shard (~2GB) contains samples with three files per sample: ``` {sample_key}.audio.flac # Original audio (FLAC, original sample rate) {sample_key}.dacvae.npy # DAC VAE latent [T_latent, 128] numpy float32 {sample_key}.metadata.json # All metadata + duration_seconds + chars_per_second ``` ### DAC VAE Latent Format - **Model**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) (Facebook DACVAE) - **Input sample rate**: 48,000 Hz (audio resampled before encoding) - **Latent shape**: `[T_latent, 128]` where `T_latent = ceil(audio_samples / 1920)` - **Latent rate**: 25 frames/second - **Storage**: numpy float32 ### Shard Naming `{LANG}-{split}-{index:05d}.tar` (e.g., `EN-train-00000.tar`, `DE-train-00001.tar`) ## Loading ### With WebDataset ```python import webdataset as wds import numpy as np import json import soundfile as sf import io url = "https://huggingface.co/datasets/TTS-AGI/commonvoice22-sidon-dacvae/resolve/main/EN-train-00000.tar" dataset = wds.WebDataset(url).decode() for sample in dataset: audio_bytes = sample["audio.flac"] latent = np.load(io.BytesIO(sample["dacvae.npy"])) # [T, 128] meta = json.loads(sample["metadata.json"]) print(f"Text: {meta['text']}, Duration: {meta['duration_seconds']}s, CPS: {meta['chars_per_second']}") ``` ### Decoding Latents Back to Audio ```python from dacvae import DACVAE from huggingface_hub import hf_hub_download import torch, numpy as np model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval() latent = np.load("sample.dacvae.npy") # [T_latent, 128] z = torch.from_numpy(latent.T).unsqueeze(0).cuda() # [1, 128, T_latent] audio_48k = model.decode(z).squeeze(0).cpu() # [1, T_audio] at 48kHz ``` ## Current Status **Shards uploaded**: 935 ### Progress by Language | Language | Samples | |----------|---------| | AB_train | 21,037 | | AF_train | 139 | | AM_train | 523 | | AR_train | 28,531 | | AS_train | 1,386 | | AZ_train | 157 | | BA_train | 121,197 | | BE_train | 347,672 | | BG_train | 4,952 | | BN_train | 21,514 | | BR_train | 3,510 | | CA_train | 1,158,926 | | CK_train | 7,878 | | CN_train | 818 | | CS_train | 21,731 | | CV_train | 1,456 | | CY_train | 8,014 | | DA_train | 5,699 | | DE_train | 607,871 | | DY_train | 88 | | EL_train | 1,934 | | EN_train | 1,138,759 | | EO_train | 128,103 | | ES_train | 353,699 | | ET_train | 3,402 | | EU_train | 130,043 | | FA_train | 29,789 | | FI_train | 2,093 | | FR_train | 593,066 | | FY_train | 3,924 | | GA_train | 546 | | GL_train | 70,039 | | GN_train | 1,641 | | HA_train | 1,908 | | HE_train | 1,011 | | HI_train | 4,869 | | HS_train | 809 | | HT_train | 11 | | HU_train | 39,270 | | HY_train | 9,302 | | IA_train | 4,909 | | ID_train | 4,973 | | IG_train | 9 | | IS_train | 17 | | IT_train | 172,828 | | JA_train | 15,425 | | KA_train | 215,015 | | KK_train | 605 | | KL_train | 11,064 | | KO_train | 519 | | KY_train | 1,790 | | LG_train | 64,144 | | LI_train | 2,304 | | LO_train | 98 | | LT_train | 12,895 | | LU_train | 4,498 | | LV_train | 4,410 | | MD_train | 175 | | MH_train | 186,565 | | MK_train | 2,049 | | ML_train | 1,235 | | MN_train | 2,193 | | MR_train | 16,514 | | MT_train | 1,910 | | MY_train | 1,241 | | NA_train | 11,608 | | NB_train | 227 | | NE_train | 353 | | NH_train | 23 | | NL_train | 43,458 | | NN_train | 464 | | NS_train | 2 | | OC_train | 304 | | OR_train | 2,151 | | OS_train | 414 | | PA_train | 800 | | PL_train | 24,173 | | PS_train | 4,611 | | PT_train | 22,923 | | QU_train | 26 | | RM_train | 2,148 | | RO_train | 5,178 | | RU_train | 26,654 | | RW_train | 1,003,029 | | SA_train | 2,528 | | SC_train | 925 | | SD_train | 271 | | SK_train | 8,910 | | SL_train | 1,469 | | SQ_train | 2,658 | | SR_train | 2,336 | | SV_train | 8,150 | | SW_train | 46,534 | | TA_train | 46,390 | | TE_train | 69 | | TG_train | 123 | | TH_train | 32,959 | | TI_train | 2,010 | | TK_train | 741 | | TN_train | 1,078 | | TO_train | 2,630 | | TR_train | 40,377 | | TT_train | 8,871 | | TW_train | 205 | | UG_train | 107,646 | | UK_train | 26,773 | | UR_train | 7,326 | | UZ_train | 48,733 | | VI_train | 2,104 | | VO_train | 96 | | XH_train | 7 | | YI_train | 320 | | YO_train | 1,404 | | YU_train | 7,419 | | ZG_train | 842 | | ZH_train | 45,246 | | ZU_train | 12 | | ZZ_train | 734 | ## Metadata Fields Each `metadata.json` contains: - `dataset`: Source dataset name - `language`: Language code - `split`: Data split (train/dev/test) - `sample_id`: Original sample identifier - `text`: Transcript - `duration_seconds`: Audio duration in seconds - `chars_per_second`: Text characters per second of audio - `original_sample_rate`: Original audio sample rate - `dacvae_sample_rate`: 48000 (DAC VAE input rate) - `latent_frames`: Number of latent time frames - Plus all original dataset-specific fields --- Generated with [Claude Code](https://claude.com/claude-code)
提供机构:
TTS-AGI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作