five

TTS-AGI/mls-enhanced-dacvae

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/mls-enhanced-dacvae
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: source: facebook/multilingual_librispeech format: WebDataset tar shards with DAC VAE latents license: cc-by-4.0 task_categories: - automatic-speech-recognition - text-to-speech --- # Multilingual LibriSpeech converted to DAC VAE latents ## Source [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) ## Format Each tar shard (~2GB) contains samples with three files per sample: ``` {sample_key}.audio.flac # Original audio (FLAC, original sample rate) {sample_key}.dacvae.npy # DAC VAE latent [T_latent, 128] numpy float32 {sample_key}.metadata.json # All metadata + duration_seconds + chars_per_second ``` ### DAC VAE Latent Format - **Model**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) (Facebook DACVAE) - **Input sample rate**: 48,000 Hz (audio resampled before encoding) - **Latent shape**: `[T_latent, 128]` where `T_latent = ceil(audio_samples / 1920)` - **Latent rate**: 25 frames/second - **Storage**: numpy float32 ### Shard Naming `{LANG}-{split}-{index:05d}.tar` (e.g., `EN-train-00000.tar`, `DE-train-00001.tar`) ## Loading ### With WebDataset ```python import webdataset as wds import numpy as np import json import soundfile as sf import io url = "https://huggingface.co/datasets/TTS-AGI/mls-enhanced-dacvae/resolve/main/EN-train-00000.tar" dataset = wds.WebDataset(url).decode() for sample in dataset: audio_bytes = sample["audio.flac"] latent = np.load(io.BytesIO(sample["dacvae.npy"])) # [T, 128] meta = json.loads(sample["metadata.json"]) print(f"Text: {meta['text']}, Duration: {meta['duration_seconds']}s, CPS: {meta['chars_per_second']}") ``` ### Decoding Latents Back to Audio ```python from dacvae import DACVAE from huggingface_hub import hf_hub_download import torch, numpy as np model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval() latent = np.load("sample.dacvae.npy") # [T_latent, 128] z = torch.from_numpy(latent.T).unsqueeze(0).cuda() # [1, 128, T_latent] audio_48k = model.decode(z).squeeze(0).cpu() # [1, T_audio] at 48kHz ``` ## Current Status **Shards uploaded**: 125 ### Progress by Language | Language | Samples | |----------|---------| | DE_train | 77,768 | | ES_train | 74,808 | | FR_train | 89,640 | | IT_train | 59,623 | | NL_train | 90,656 | | PL_train | 25,043 | | PT_train | 37,533 | ## Metadata Fields Each `metadata.json` contains: - `dataset`: Source dataset name - `language`: Language code - `split`: Data split (train/dev/test) - `sample_id`: Original sample identifier - `text`: Transcript - `duration_seconds`: Audio duration in seconds - `chars_per_second`: Text characters per second of audio - `original_sample_rate`: Original audio sample rate - `dacvae_sample_rate`: 48000 (DAC VAE input rate) - `latent_frames`: Number of latent time frames - Plus all original dataset-specific fields --- Generated with [Claude Code](https://claude.com/claude-code)
提供机构:
TTS-AGI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作