five

CowardDriver/ttsdata

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CowardDriver/ttsdata
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: see-source-datasets tags: - audio - tts - latents - dac-vae - speech size_categories: - 100K<n<1M language: - en --- # EnvTTS Phase 1 — Pre-encoded DAC-VAE Latents Pre-encoded audio latents for Phase 1 training of **EnvAudioEdit** (Small ~195 M CFM-DiT TTS model). Audio from three English speech datasets is encoded offline with DAC-VAE (48 kHz, hop=1920), saving GPU time during training by avoiding on-the-fly encoding. --- ## Contents ``` latents.zip └── latents/ ├── cv/ # 180 000 files (humanify/common_voice_english, 10%) ├── ps/ # 216 000 files (humanify/ps, 10%) └── ht2/ # 313 000 files (humanify/ht2_44khz, 10%) ``` **Total: 709 000 `.pt` files, ~43 GB (zipped)** --- ## File Format Each `.pt` file is a PyTorch tensor dict: ```python { "z": Tensor[T, 128], # DAC-VAE latent, float16 "text": str, # transcript "length": int, # = T (number of latent frames) } ``` | Field | Details | |---|---| | Audio codec | DAC-VAE (`matbee/sam-audio-small-onnx`) | | Sample rate | 48 000 Hz | | Hop length | 1 920 samples/frame | | Latent dim | 128 | | Max frames | 500 (≈ 20 s) — longer clips truncated | | dtype | float16 | Time ↔ frame conversion: `seconds = frames × 1920 / 48000` --- ## Source Datasets | Subdir | Source | Size used | |---|---|---| | `cv` | [humanify/common_voice_english](https://huggingface.co/datasets/humanify/common_voice_english) | 10 % ≈ 180 K | | `ps` | [humanify/ps](https://huggingface.co/datasets/humanify/ps) | 10 % ≈ 216 K | | `ht2` | [humanify/ht2_44khz](https://huggingface.co/datasets/humanify/ht2_44khz) | 10 % ≈ 313 K | > Original audio is licensed under the respective source dataset licenses. > This dataset distributes only derived latent representations. --- ## Usage ### Extract ```bash unzip latents.zip -d data/ # → data/latents/cv/, data/latents/ps/, data/latents/ht2/ ``` ### Load a single sample ```python import torch sample = torch.load("data/latents/cv/000000049.pt", weights_only=False) z = sample["z"] # Tensor[T, 128], float16 text = sample["text"] # str length = sample["length"] # int == z.shape[0] ``` ### Use in EnvAudioEdit training (local latents mode) Set `use_local_latents: true` in your training config and point `latent_dir` at the extracted directories: ```yaml # configs/train_small_phase1.yaml use_local_latents: true latent_dir: - "data/latents/cv" - "data/latents/ps" - "data/latents/ht2" ``` Then launch training: ```bash accelerate launch scripts/train_phase1.py --config configs/train_small_phase1.yaml ``` --- ## Encoding Environment | Package | Version | |---|---| | onnxruntime-gpu | 1.23.2 | | nvidia-cudnn-cu12 | 9.5.1.17 | | torch | 2.11.0+cu130 | > cuDNN 9.20 has a Conv1D bug; 8.x breaks PyTorch — pin to 9.5.1.17. --- ## Related - Model checkpoints: [`<your-username>/envtts-small-phase1`](https://huggingface.co/<your-username>/envtts-small-phase1) - Architecture & training plan: see the model repo README.
提供机构:
CowardDriver
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作