CowardDriver/ttsdata
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CowardDriver/ttsdata
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: see-source-datasets
tags:
- audio
- tts
- latents
- dac-vae
- speech
size_categories:
- 100K<n<1M
language:
- en
---
# EnvTTS Phase 1 — Pre-encoded DAC-VAE Latents
Pre-encoded audio latents for Phase 1 training of **EnvAudioEdit** (Small ~195 M CFM-DiT TTS model).
Audio from three English speech datasets is encoded offline with DAC-VAE (48 kHz, hop=1920),
saving GPU time during training by avoiding on-the-fly encoding.
---
## Contents
```
latents.zip
└── latents/
├── cv/ # 180 000 files (humanify/common_voice_english, 10%)
├── ps/ # 216 000 files (humanify/ps, 10%)
└── ht2/ # 313 000 files (humanify/ht2_44khz, 10%)
```
**Total: 709 000 `.pt` files, ~43 GB (zipped)**
---
## File Format
Each `.pt` file is a PyTorch tensor dict:
```python
{
"z": Tensor[T, 128], # DAC-VAE latent, float16
"text": str, # transcript
"length": int, # = T (number of latent frames)
}
```
| Field | Details |
|---|---|
| Audio codec | DAC-VAE (`matbee/sam-audio-small-onnx`) |
| Sample rate | 48 000 Hz |
| Hop length | 1 920 samples/frame |
| Latent dim | 128 |
| Max frames | 500 (≈ 20 s) — longer clips truncated |
| dtype | float16 |
Time ↔ frame conversion: `seconds = frames × 1920 / 48000`
---
## Source Datasets
| Subdir | Source | Size used |
|---|---|---|
| `cv` | [humanify/common_voice_english](https://huggingface.co/datasets/humanify/common_voice_english) | 10 % ≈ 180 K |
| `ps` | [humanify/ps](https://huggingface.co/datasets/humanify/ps) | 10 % ≈ 216 K |
| `ht2` | [humanify/ht2_44khz](https://huggingface.co/datasets/humanify/ht2_44khz) | 10 % ≈ 313 K |
> Original audio is licensed under the respective source dataset licenses.
> This dataset distributes only derived latent representations.
---
## Usage
### Extract
```bash
unzip latents.zip -d data/
# → data/latents/cv/, data/latents/ps/, data/latents/ht2/
```
### Load a single sample
```python
import torch
sample = torch.load("data/latents/cv/000000049.pt", weights_only=False)
z = sample["z"] # Tensor[T, 128], float16
text = sample["text"] # str
length = sample["length"] # int == z.shape[0]
```
### Use in EnvAudioEdit training (local latents mode)
Set `use_local_latents: true` in your training config and point `latent_dir` at the extracted directories:
```yaml
# configs/train_small_phase1.yaml
use_local_latents: true
latent_dir:
- "data/latents/cv"
- "data/latents/ps"
- "data/latents/ht2"
```
Then launch training:
```bash
accelerate launch scripts/train_phase1.py --config configs/train_small_phase1.yaml
```
---
## Encoding Environment
| Package | Version |
|---|---|
| onnxruntime-gpu | 1.23.2 |
| nvidia-cudnn-cu12 | 9.5.1.17 |
| torch | 2.11.0+cu130 |
> cuDNN 9.20 has a Conv1D bug; 8.x breaks PyTorch — pin to 9.5.1.17.
---
## Related
- Model checkpoints: [`<your-username>/envtts-small-phase1`](https://huggingface.co/<your-username>/envtts-small-phase1)
- Architecture & training plan: see the model repo README.
提供机构:
CowardDriver



