TTS-AGI/commonvoice22-sidon-dacvae
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/commonvoice22-sidon-dacvae
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
source: sarulab-speech/commonvoice22_sidon
format: WebDataset tar shards with DAC VAE latents
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
---
# CommonVoice 22 (Sidon-enhanced) converted to DAC VAE latents
## Source
[sarulab-speech/commonvoice22_sidon](https://huggingface.co/datasets/sarulab-speech/commonvoice22_sidon)
## Format
Each tar shard (~2GB) contains samples with three files per sample:
```
{sample_key}.audio.flac # Original audio (FLAC, original sample rate)
{sample_key}.dacvae.npy # DAC VAE latent [T_latent, 128] numpy float32
{sample_key}.metadata.json # All metadata + duration_seconds + chars_per_second
```
### DAC VAE Latent Format
- **Model**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) (Facebook DACVAE)
- **Input sample rate**: 48,000 Hz (audio resampled before encoding)
- **Latent shape**: `[T_latent, 128]` where `T_latent = ceil(audio_samples / 1920)`
- **Latent rate**: 25 frames/second
- **Storage**: numpy float32
### Shard Naming
`{LANG}-{split}-{index:05d}.tar` (e.g., `EN-train-00000.tar`, `DE-train-00001.tar`)
## Loading
### With WebDataset
```python
import webdataset as wds
import numpy as np
import json
import soundfile as sf
import io
url = "https://huggingface.co/datasets/TTS-AGI/commonvoice22-sidon-dacvae/resolve/main/EN-train-00000.tar"
dataset = wds.WebDataset(url).decode()
for sample in dataset:
audio_bytes = sample["audio.flac"]
latent = np.load(io.BytesIO(sample["dacvae.npy"])) # [T, 128]
meta = json.loads(sample["metadata.json"])
print(f"Text: {meta['text']}, Duration: {meta['duration_seconds']}s, CPS: {meta['chars_per_second']}")
```
### Decoding Latents Back to Audio
```python
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
import torch, numpy as np
model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval()
latent = np.load("sample.dacvae.npy") # [T_latent, 128]
z = torch.from_numpy(latent.T).unsqueeze(0).cuda() # [1, 128, T_latent]
audio_48k = model.decode(z).squeeze(0).cpu() # [1, T_audio] at 48kHz
```
## Current Status
**Shards uploaded**: 935
### Progress by Language
| Language | Samples |
|----------|---------|
| AB_train | 21,037 |
| AF_train | 139 |
| AM_train | 523 |
| AR_train | 28,531 |
| AS_train | 1,386 |
| AZ_train | 157 |
| BA_train | 121,197 |
| BE_train | 347,672 |
| BG_train | 4,952 |
| BN_train | 21,514 |
| BR_train | 3,510 |
| CA_train | 1,158,926 |
| CK_train | 7,878 |
| CN_train | 818 |
| CS_train | 21,731 |
| CV_train | 1,456 |
| CY_train | 8,014 |
| DA_train | 5,699 |
| DE_train | 607,871 |
| DY_train | 88 |
| EL_train | 1,934 |
| EN_train | 1,138,759 |
| EO_train | 128,103 |
| ES_train | 353,699 |
| ET_train | 3,402 |
| EU_train | 130,043 |
| FA_train | 29,789 |
| FI_train | 2,093 |
| FR_train | 593,066 |
| FY_train | 3,924 |
| GA_train | 546 |
| GL_train | 70,039 |
| GN_train | 1,641 |
| HA_train | 1,908 |
| HE_train | 1,011 |
| HI_train | 4,869 |
| HS_train | 809 |
| HT_train | 11 |
| HU_train | 39,270 |
| HY_train | 9,302 |
| IA_train | 4,909 |
| ID_train | 4,973 |
| IG_train | 9 |
| IS_train | 17 |
| IT_train | 172,828 |
| JA_train | 15,425 |
| KA_train | 215,015 |
| KK_train | 605 |
| KL_train | 11,064 |
| KO_train | 519 |
| KY_train | 1,790 |
| LG_train | 64,144 |
| LI_train | 2,304 |
| LO_train | 98 |
| LT_train | 12,895 |
| LU_train | 4,498 |
| LV_train | 4,410 |
| MD_train | 175 |
| MH_train | 186,565 |
| MK_train | 2,049 |
| ML_train | 1,235 |
| MN_train | 2,193 |
| MR_train | 16,514 |
| MT_train | 1,910 |
| MY_train | 1,241 |
| NA_train | 11,608 |
| NB_train | 227 |
| NE_train | 353 |
| NH_train | 23 |
| NL_train | 43,458 |
| NN_train | 464 |
| NS_train | 2 |
| OC_train | 304 |
| OR_train | 2,151 |
| OS_train | 414 |
| PA_train | 800 |
| PL_train | 24,173 |
| PS_train | 4,611 |
| PT_train | 22,923 |
| QU_train | 26 |
| RM_train | 2,148 |
| RO_train | 5,178 |
| RU_train | 26,654 |
| RW_train | 1,003,029 |
| SA_train | 2,528 |
| SC_train | 925 |
| SD_train | 271 |
| SK_train | 8,910 |
| SL_train | 1,469 |
| SQ_train | 2,658 |
| SR_train | 2,336 |
| SV_train | 8,150 |
| SW_train | 46,534 |
| TA_train | 46,390 |
| TE_train | 69 |
| TG_train | 123 |
| TH_train | 32,959 |
| TI_train | 2,010 |
| TK_train | 741 |
| TN_train | 1,078 |
| TO_train | 2,630 |
| TR_train | 40,377 |
| TT_train | 8,871 |
| TW_train | 205 |
| UG_train | 107,646 |
| UK_train | 26,773 |
| UR_train | 7,326 |
| UZ_train | 48,733 |
| VI_train | 2,104 |
| VO_train | 96 |
| XH_train | 7 |
| YI_train | 320 |
| YO_train | 1,404 |
| YU_train | 7,419 |
| ZG_train | 842 |
| ZH_train | 45,246 |
| ZU_train | 12 |
| ZZ_train | 734 |
## Metadata Fields
Each `metadata.json` contains:
- `dataset`: Source dataset name
- `language`: Language code
- `split`: Data split (train/dev/test)
- `sample_id`: Original sample identifier
- `text`: Transcript
- `duration_seconds`: Audio duration in seconds
- `chars_per_second`: Text characters per second of audio
- `original_sample_rate`: Original audio sample rate
- `dacvae_sample_rate`: 48000 (DAC VAE input rate)
- `latent_frames`: Number of latent time frames
- Plus all original dataset-specific fields
---
Generated with [Claude Code](https://claude.com/claude-code)
提供机构:
TTS-AGI



