TTS-AGI/voice-emo-cloning-dataset
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/voice-emo-cloning-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Emotion-Cloning TTS Training Dataset
## Location
```
/home/deployer/laion/echo-tts-training-main/emotion_eval/dataset_output/
```
## Overview
This dataset contains **~22,518 training triplets** for fine-tuning a zero-shot voice+emotion cloning TTS model. Each sample provides everything needed to train a model that can clone both a speaker's voice identity AND their emotional delivery from separate reference audio clips.
The data is stored as **WebDataset `.tar` shards**, partitioned across 8 GPUs. Shards are written incrementally — the dataset is usable at any point during generation (balanced across all 40 emotions via round-robin ordering).
**Generation is ongoing.** Check progress:
```bash
total=0; for i in 0 1 2 3 4 5 6 7; do
n=$(python3 -c "import json; print(len(json.load(open('checkpoint_gpu${i}.json'))))")
total=$((total + n))
done; echo "$total / 22518 completed"
```
## Shard Format
Each shard is a standard WebDataset tar file: `shard-gpuXX-YYYYY.tar`
Each sample inside a shard has a unique key (e.g., `Anger_0612`) and contains these files:
| File | Format | Sample Rate | Description |
|------|--------|-------------|-------------|
| `{key}.target.wav` | WAV int16 | 44,100 Hz | Original emotional speech from the source dataset |
| `{key}.speaker_ref.wav` | WAV int16 | 44,100 Hz | **Sample A** — neutral speech voice-converted to the target speaker's identity |
| `{key}.emotion_ref.wav` | WAV int16 | 44,100 Hz | **Sample B** — LLM-paraphrased emotional speech, voice-converted to a neutral speaker's identity |
| `{key}.concat.wav` | WAV int16 | 44,100 Hz | Sample A + 10kHz sine separator (1s) + Sample B |
| `{key}.target.dacvae.npy` | NumPy float32 | — | DACVAE latent of target (encoded at 48kHz) |
| `{key}.speaker_ref.dacvae.npy` | NumPy float32 | — | DACVAE latent of Sample A |
| `{key}.emotion_ref.dacvae.npy` | NumPy float32 | — | DACVAE latent of Sample B |
| `{key}.concat.dacvae.npy` | NumPy float32 | — | DACVAE latent of concatenated audio |
| `{key}.metadata.json` | JSON | — | Full metadata (see below) |
### DACVAE Latent Format
- Shape: `(T, 128)` where T = number of time frames
- Model: `mrfakename/dacvae-watermarked` (encoder_rates=[2,8,10,12], codebook_dim=128, sample_rate=48000, hop=1920)
- To decode: `z = torch.from_numpy(latent.T).unsqueeze(0).to(device)` then `audio = dacvae.decode(z)`
- Output sample rate after decoding: **48,000 Hz**
### Metadata JSON Fields
```json
{
"emotion_bucket_label": "Anger",
"target_transcription": "original speech transcript",
"target_caption": "descriptive caption of the audio",
"generated_emotional_text": "LLM-paraphrased version (different words, same emotion)",
"cosine_similarity_score": 0.8853,
"best_seed": 123,
"target_emotion_magnitude_score": 2.504,
"target_duration": 13.36,
"neutral_emotion": "Sexual_Lust",
"neutral_text": "transcript of the neutral reference",
"length_mode": "longer|shorter|same",
"target_pitch": 1.791,
"target_gender": -0.5195,
"neutral_pitch": 1.839,
"neutral_gender": -1.224,
"target_empathic_scores": { "55 emotion + 4 quality scores": "..." },
"generated_empathic_scores": { "55 emotion + 4 quality scores": "..." },
"speaker_ref_duration": 5.8,
"emotion_ref_duration": 17.28,
"concat_duration": 24.08
}
```
## How the Triplets Were Built
Each training sample was constructed through a 9-step pipeline:
1. **Target selection**: Top emotional samples from `TTS-AGI/emotion-attribute-conditioning-dacvae` (40 emotion buckets, min 5s duration, ranked by emotion magnitude)
2. **Neutral selection**: A sample from a *different* emotion bucket with pitch and gender score difference >= 2.0 from target (ensures clearly different speaker characteristics)
3. **Voice conversion A**: Neutral audio → target speaker identity using Chatterbox VC (creates **Sample A / Speaker Ref** — same voice as target, neutral emotion)
4. **LLM paraphrase**: Gemini rewrites the target transcript with entirely different words but same emotion+meaning. Length distribution: 25% shorter, 25% same, 50% longer
5. **TTS generation**: Echo TTS generates the paraphrase using the target audio as style reference (3 seeds: 42, 123, 456)
6. **Emotion scoring**: Empathic Insight Voice+ (BUD-E-Whisper + 55 emotion MLPs) scores both target and each TTS generation
7. **Best selection**: TTS generation with highest cosine similarity to target's emotion vector is selected
8. **Voice conversion B**: Best TTS → neutral speaker identity using Chatterbox VC (creates **Sample B / Emotion Ref** — different voice from target, same emotion)
9. **DACVAE encoding**: All audio encoded to latent space for efficient training
### Training Concept
The model should learn to:
- **From Sample A (speaker_ref)**: Clone the speaker's voice/identity
- **From Sample B (emotion_ref)**: Clone the emotional delivery style
- **Generate**: Speech that sounds like Sample A's voice with Sample B's emotion
The `concat.wav` / `concat.dacvae.npy` provides a single-file input format: `[speaker_ref] [sine_separator] [emotion_ref]`
## 40 Emotion Categories
| Emotion | Samples | | Emotion | Samples |
|---------|--------:|-|---------|--------:|
| Affection | 1,000 | | Interest | 1,000 |
| Amusement | 1,000 | | Intoxication/Altered States | 1,000 |
| Anger | 1,000 | | Jealousy & Envy | 46 |
| Astonishment/Surprise | 1,000 | | Longing | 183 |
| Awe | 134 | | Malevolence/Malice | 374 |
| Bitterness | 41 | | Pain | 251 |
| Concentration | 1,000 | | Pleasure/Ecstasy | 5 |
| Confusion | 1,000 | | Pride | 280 |
| Contemplation | 1,000 | | Relief | 1,000 |
| Contempt | 143 | | Sadness | 496 |
| Contentment | 256 | | Sexual Lust | 927 |
| Disappointment | 666 | | Shame | 512 |
| Disgust | 124 | | Sourness | 15 |
| Distress | 975 | | Teasing | 151 |
| Doubt | 199 | | Thankfulness/Gratitude | 1,000 |
| Elation | 1,000 | | Triumph | 774 |
| Embarrassment | 75 | | Fatigue/Exhaustion | 1,000 |
| Emotional Numbness | 68 | | Hope/Enthusiasm/Optimism | 1,000 |
| Fear | 384 | | Impatience/Irritability | 1,000 |
| Infatuation | 407 | | **Total** | **22,518** |
## Loading the Data
### With WebDataset (recommended for training)
```python
import webdataset as wds
import numpy as np
import json
import glob
# Find all completed shards
shards = sorted(glob.glob("/home/deployer/laion/echo-tts-training-main/emotion_eval/dataset_output/shard-gpu*.tar"))
dataset = (
wds.WebDataset(shards)
.decode() # auto-decodes wav, npy, json
.to_tuple("concat.dacvae.npy", "target.dacvae.npy", "metadata.json")
)
for concat_latent, target_latent, metadata in dataset:
emotion = metadata["emotion_bucket_label"]
cosine = metadata["cosine_similarity_score"]
# concat_latent shape: (T, 128) — speaker_ref + sine + emotion_ref
# target_latent shape: (T, 128) — ground truth emotional speech
...
```
### With WebDataset (individual components)
```python
dataset = (
wds.WebDataset(shards)
.decode()
.to_tuple(
"speaker_ref.dacvae.npy", # Sample A latent (voice identity)
"emotion_ref.dacvae.npy", # Sample B latent (emotional delivery)
"target.dacvae.npy", # Ground truth target latent
"metadata.json",
)
)
for speaker_latent, emotion_latent, target_latent, metadata in dataset:
# speaker_latent: neutral content, target voice identity
# emotion_latent: emotional content, neutral voice identity
# target_latent: ground truth (target voice + target emotion)
...
```
### Manual tar extraction
```python
import tarfile
import numpy as np
import json
with tarfile.open("shard-gpu00-00000.tar") as tf:
for member in tf:
if member.name.endswith(".metadata.json"):
data = json.loads(tf.extractfile(member).read())
key = member.name.replace(".metadata.json", "")
print(f"{key}: {data['emotion_bucket_label']} cosine={data['cosine_similarity_score']:.3f}")
```
### Decoding DACVAE latents back to audio
```python
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
import torch
import numpy as np
weights = hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")
dacvae = DACVAE.load(weights).to("cuda").eval()
latent = np.load("sample.dacvae.npy") # shape (T, 128)
z = torch.from_numpy(latent.astype(np.float32)).T.unsqueeze(0).to("cuda")
with torch.no_grad():
audio = dacvae.decode(z).squeeze(0).cpu()
# audio shape: (1, num_samples), sample_rate = 48000
```
### Generating the 10kHz sine separator
The separator between Sample A and Sample B in `concat.wav` is a 1-second 10kHz sine tone at 0.5 amplitude. This acts as a clear delimiter the model can learn to recognize.
```python
import torch
import math
def generate_sine_separator(sample_rate=44100, freq=10000, duration=1.0, amplitude=0.5):
"""Generate the 10kHz sine tone separator used between speaker_ref and emotion_ref."""
t = torch.linspace(0, duration, int(sample_rate * duration))
sine = (amplitude * torch.sin(2 * math.pi * freq * t)).unsqueeze(0) # shape: (1, num_samples)
return sine
separator = generate_sine_separator()
# separator shape: (1, 44100) — 1 channel, 1 second at 44.1kHz
```
### Concatenating speaker_ref + separator + emotion_ref
To build the concatenated input from individual components (e.g., at inference time or if you want to reconstruct `concat.wav` from the separate files):
```python
import torch
import torchaudio
import math
def generate_sine_separator(sr=44100, freq=10000, dur=1.0):
t = torch.linspace(0, dur, int(sr * dur))
return (0.5 * torch.sin(2 * math.pi * freq * t)).unsqueeze(0)
# From wav files
speaker_ref, sr = torchaudio.load("speaker_ref.wav") # (1, T1) at 44100Hz
emotion_ref, sr = torchaudio.load("emotion_ref.wav") # (1, T2) at 44100Hz
separator = generate_sine_separator(sr=sr) # (1, 44100)
concat = torch.cat([speaker_ref, separator, emotion_ref], dim=1)
torchaudio.save("concat.wav", concat, sr)
```
From DACVAE latents (for latent-space training):
```python
import numpy as np
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
import torch
import torchaudio
# Load DACVAE
weights = hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")
dacvae = DACVAE.load(weights).to("cuda").eval()
DACVAE_SR = 48000
ECHO_SR = 44100
def decode_latent(dacvae, npy_path, device="cuda"):
latent = np.load(npy_path)
z = torch.from_numpy(latent.astype(np.float32)).T.unsqueeze(0).to(device)
with torch.no_grad():
return dacvae.decode(z).squeeze(0).cpu() # (1, T) at 48kHz
def encode_audio(dacvae, audio, device="cuda"):
with torch.no_grad():
z = dacvae.encode(audio.unsqueeze(0).to(device))
return z.squeeze(0).T.cpu().numpy() # (T, 128)
# Decode individual latents → 48kHz audio
speaker_48k = decode_latent(dacvae, "speaker_ref.dacvae.npy")
emotion_48k = decode_latent(dacvae, "emotion_ref.dacvae.npy")
# Resample to 44.1kHz for concatenation
resample = torchaudio.transforms.Resample(DACVAE_SR, ECHO_SR)
speaker_44k = resample(speaker_48k)
emotion_44k = resample(emotion_48k)
# Generate separator at 44.1kHz, then concatenate
separator = generate_sine_separator(sr=ECHO_SR)
concat_44k = torch.cat([speaker_44k, separator, emotion_44k], dim=1)
# Resample back to 48kHz and encode to DACVAE latent
concat_48k = torchaudio.transforms.Resample(ECHO_SR, DACVAE_SR)(concat_44k)
concat_latent = encode_audio(dacvae, concat_48k)
np.save("concat.dacvae.npy", concat_latent)
```
Note: The pre-built `concat.dacvae.npy` in the shards is the recommended way to use the concatenated input. Only rebuild from components if you need to modify the separator or combine different speaker/emotion refs at inference time.
## Quality Filtering
The `cosine_similarity_score` in metadata measures how well the generated emotional speech matches the target's emotion profile (40-dim emotion vector cosine similarity, excluding quality scores). Use this to filter:
```python
# High-quality subset (cosine > 0.85)
dataset = (
wds.WebDataset(shards)
.decode()
.select(lambda sample: json.loads(sample["metadata.json"])["cosine_similarity_score"] > 0.85)
)
```
## Models Used
| Component | Model | Source |
|-----------|-------|--------|
| Audio autoencoder | DACVAE | `mrfakename/dacvae-watermarked` |
| Voice conversion | Chatterbox VC | `chatterbox-tts` (Resemble AI) |
| TTS generation | Open Echo TTS | `jordand/echo-tts-base` |
| Emotion scoring | Empathic Insight Voice+ | `laion/BUD-E-Whisper` + `laion/Empathic-Insight-Voice-Plus` |
| Text paraphrase | Gemini 2.5 Flash | Google Gemini API |
## Source Dataset
`TTS-AGI/emotion-attribute-conditioning-dacvae` on Hugging Face — 88,171 annotated audio samples across 40 emotion categories, stored as DACVAE latents with metadata (transcription, caption, emotion scores, pitch, gender).
## File Structure
```
dataset_output/
shard-gpu00-00000.tar # WebDataset shard from GPU 0, batch 0
shard-gpu00-00001.tar # ... batch 1 (created when batch 0 reaches 2000 samples)
shard-gpu01-00000.tar # WebDataset shard from GPU 1
...
checkpoint_gpu0.json # List of completed job IDs for GPU 0
checkpoint_gpu1.json # ...
...
README.md # This file
```
## Resuming / Monitoring
The pipeline is fully resumable. If workers crash, just relaunch:
```bash
cd /home/deployer/laion/echo-tts-training-main/emotion_eval
LD_LIBRARY_PATH="" nohup /home/deployer/laion/spiritvenv/bin/python pipeline_launch.py > jobs_full/launcher.log 2>&1 &
```
Monitor progress:
```bash
# Quick count
for i in 0 1 2 3 4 5 6 7; do
echo -n "GPU $i: "
python3 -c "import json; print(len(json.load(open('dataset_output/checkpoint_gpu${i}.json'))))"
done
# Live worker logs
tail -f jobs_full/gpu_0.log
```
提供机构:
TTS-AGI



