TTS-AGI/mls-enhanced-dacvae
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/mls-enhanced-dacvae
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
source: facebook/multilingual_librispeech
format: WebDataset tar shards with DAC VAE latents
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
---
# Multilingual LibriSpeech converted to DAC VAE latents
## Source
[facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
## Format
Each tar shard (~2GB) contains samples with three files per sample:
```
{sample_key}.audio.flac # Original audio (FLAC, original sample rate)
{sample_key}.dacvae.npy # DAC VAE latent [T_latent, 128] numpy float32
{sample_key}.metadata.json # All metadata + duration_seconds + chars_per_second
```
### DAC VAE Latent Format
- **Model**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) (Facebook DACVAE)
- **Input sample rate**: 48,000 Hz (audio resampled before encoding)
- **Latent shape**: `[T_latent, 128]` where `T_latent = ceil(audio_samples / 1920)`
- **Latent rate**: 25 frames/second
- **Storage**: numpy float32
### Shard Naming
`{LANG}-{split}-{index:05d}.tar` (e.g., `EN-train-00000.tar`, `DE-train-00001.tar`)
## Loading
### With WebDataset
```python
import webdataset as wds
import numpy as np
import json
import soundfile as sf
import io
url = "https://huggingface.co/datasets/TTS-AGI/mls-enhanced-dacvae/resolve/main/EN-train-00000.tar"
dataset = wds.WebDataset(url).decode()
for sample in dataset:
audio_bytes = sample["audio.flac"]
latent = np.load(io.BytesIO(sample["dacvae.npy"])) # [T, 128]
meta = json.loads(sample["metadata.json"])
print(f"Text: {meta['text']}, Duration: {meta['duration_seconds']}s, CPS: {meta['chars_per_second']}")
```
### Decoding Latents Back to Audio
```python
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
import torch, numpy as np
model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval()
latent = np.load("sample.dacvae.npy") # [T_latent, 128]
z = torch.from_numpy(latent.T).unsqueeze(0).cuda() # [1, 128, T_latent]
audio_48k = model.decode(z).squeeze(0).cpu() # [1, T_audio] at 48kHz
```
## Current Status
**Shards uploaded**: 125
### Progress by Language
| Language | Samples |
|----------|---------|
| DE_train | 77,768 |
| ES_train | 74,808 |
| FR_train | 89,640 |
| IT_train | 59,623 |
| NL_train | 90,656 |
| PL_train | 25,043 |
| PT_train | 37,533 |
## Metadata Fields
Each `metadata.json` contains:
- `dataset`: Source dataset name
- `language`: Language code
- `split`: Data split (train/dev/test)
- `sample_id`: Original sample identifier
- `text`: Transcript
- `duration_seconds`: Audio duration in seconds
- `chars_per_second`: Text characters per second of audio
- `original_sample_rate`: Original audio sample rate
- `dacvae_sample_rate`: 48000 (DAC VAE input rate)
- `latent_frames`: Number of latent time frames
- Plus all original dataset-specific fields
---
Generated with [Claude Code](https://claude.com/claude-code)
提供机构:
TTS-AGI



