TTS-AGI/voice-emo-cloning-dataset

Name: TTS-AGI/voice-emo-cloning-dataset
Creator: TTS-AGI
Published: 2026-03-20 19:36:04
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/TTS-AGI/voice-emo-cloning-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Emotion-Cloning TTS Training Dataset ## Location ``` /home/deployer/laion/echo-tts-training-main/emotion_eval/dataset_output/ ``` ## Overview This dataset contains **~22,518 training triplets** for fine-tuning a zero-shot voice+emotion cloning TTS model. Each sample provides everything needed to train a model that can clone both a speaker's voice identity AND their emotional delivery from separate reference audio clips. The data is stored as **WebDataset `.tar` shards**, partitioned across 8 GPUs. Shards are written incrementally — the dataset is usable at any point during generation (balanced across all 40 emotions via round-robin ordering). **Generation is ongoing.** Check progress: ```bash total=0; for i in 0 1 2 3 4 5 6 7; do n=$(python3 -c "import json; print(len(json.load(open('checkpoint_gpu${i}.json'))))") total=$((total + n)) done; echo "$total / 22518 completed" ``` ## Shard Format Each shard is a standard WebDataset tar file: `shard-gpuXX-YYYYY.tar` Each sample inside a shard has a unique key (e.g., `Anger_0612`) and contains these files: | File | Format | Sample Rate | Description | |------|--------|-------------|-------------| | `{key}.target.wav` | WAV int16 | 44,100 Hz | Original emotional speech from the source dataset | | `{key}.speaker_ref.wav` | WAV int16 | 44,100 Hz | **Sample A** — neutral speech voice-converted to the target speaker's identity | | `{key}.emotion_ref.wav` | WAV int16 | 44,100 Hz | **Sample B** — LLM-paraphrased emotional speech, voice-converted to a neutral speaker's identity | | `{key}.concat.wav` | WAV int16 | 44,100 Hz | Sample A + 10kHz sine separator (1s) + Sample B | | `{key}.target.dacvae.npy` | NumPy float32 | — | DACVAE latent of target (encoded at 48kHz) | | `{key}.speaker_ref.dacvae.npy` | NumPy float32 | — | DACVAE latent of Sample A | | `{key}.emotion_ref.dacvae.npy` | NumPy float32 | — | DACVAE latent of Sample B | | `{key}.concat.dacvae.npy` | NumPy float32 | — | DACVAE latent of concatenated audio | | `{key}.metadata.json` | JSON | — | Full metadata (see below) | ### DACVAE Latent Format - Shape: `(T, 128)` where T = number of time frames - Model: `mrfakename/dacvae-watermarked` (encoder_rates=[2,8,10,12], codebook_dim=128, sample_rate=48000, hop=1920) - To decode: `z = torch.from_numpy(latent.T).unsqueeze(0).to(device)` then `audio = dacvae.decode(z)` - Output sample rate after decoding: **48,000 Hz** ### Metadata JSON Fields ```json { "emotion_bucket_label": "Anger", "target_transcription": "original speech transcript", "target_caption": "descriptive caption of the audio", "generated_emotional_text": "LLM-paraphrased version (different words, same emotion)", "cosine_similarity_score": 0.8853, "best_seed": 123, "target_emotion_magnitude_score": 2.504, "target_duration": 13.36, "neutral_emotion": "Sexual_Lust", "neutral_text": "transcript of the neutral reference", "length_mode": "longer|shorter|same", "target_pitch": 1.791, "target_gender": -0.5195, "neutral_pitch": 1.839, "neutral_gender": -1.224, "target_empathic_scores": { "55 emotion + 4 quality scores": "..." }, "generated_empathic_scores": { "55 emotion + 4 quality scores": "..." }, "speaker_ref_duration": 5.8, "emotion_ref_duration": 17.28, "concat_duration": 24.08 } ``` ## How the Triplets Were Built Each training sample was constructed through a 9-step pipeline: 1. **Target selection**: Top emotional samples from `TTS-AGI/emotion-attribute-conditioning-dacvae` (40 emotion buckets, min 5s duration, ranked by emotion magnitude) 2. **Neutral selection**: A sample from a *different* emotion bucket with pitch and gender score difference >= 2.0 from target (ensures clearly different speaker characteristics) 3. **Voice conversion A**: Neutral audio → target speaker identity using Chatterbox VC (creates **Sample A / Speaker Ref** — same voice as target, neutral emotion) 4. **LLM paraphrase**: Gemini rewrites the target transcript with entirely different words but same emotion+meaning. Length distribution: 25% shorter, 25% same, 50% longer 5. **TTS generation**: Echo TTS generates the paraphrase using the target audio as style reference (3 seeds: 42, 123, 456) 6. **Emotion scoring**: Empathic Insight Voice+ (BUD-E-Whisper + 55 emotion MLPs) scores both target and each TTS generation 7. **Best selection**: TTS generation with highest cosine similarity to target's emotion vector is selected 8. **Voice conversion B**: Best TTS → neutral speaker identity using Chatterbox VC (creates **Sample B / Emotion Ref** — different voice from target, same emotion) 9. **DACVAE encoding**: All audio encoded to latent space for efficient training ### Training Concept The model should learn to: - **From Sample A (speaker_ref)**: Clone the speaker's voice/identity - **From Sample B (emotion_ref)**: Clone the emotional delivery style - **Generate**: Speech that sounds like Sample A's voice with Sample B's emotion The `concat.wav` / `concat.dacvae.npy` provides a single-file input format: `[speaker_ref] [sine_separator] [emotion_ref]` ## 40 Emotion Categories | Emotion | Samples | | Emotion | Samples | |---------|--------:|-|---------|--------:| | Affection | 1,000 | | Interest | 1,000 | | Amusement | 1,000 | | Intoxication/Altered States | 1,000 | | Anger | 1,000 | | Jealousy & Envy | 46 | | Astonishment/Surprise | 1,000 | | Longing | 183 | | Awe | 134 | | Malevolence/Malice | 374 | | Bitterness | 41 | | Pain | 251 | | Concentration | 1,000 | | Pleasure/Ecstasy | 5 | | Confusion | 1,000 | | Pride | 280 | | Contemplation | 1,000 | | Relief | 1,000 | | Contempt | 143 | | Sadness | 496 | | Contentment | 256 | | Sexual Lust | 927 | | Disappointment | 666 | | Shame | 512 | | Disgust | 124 | | Sourness | 15 | | Distress | 975 | | Teasing | 151 | | Doubt | 199 | | Thankfulness/Gratitude | 1,000 | | Elation | 1,000 | | Triumph | 774 | | Embarrassment | 75 | | Fatigue/Exhaustion | 1,000 | | Emotional Numbness | 68 | | Hope/Enthusiasm/Optimism | 1,000 | | Fear | 384 | | Impatience/Irritability | 1,000 | | Infatuation | 407 | | **Total** | **22,518** | ## Loading the Data ### With WebDataset (recommended for training) ```python import webdataset as wds import numpy as np import json import glob # Find all completed shards shards = sorted(glob.glob("/home/deployer/laion/echo-tts-training-main/emotion_eval/dataset_output/shard-gpu*.tar")) dataset = ( wds.WebDataset(shards) .decode() # auto-decodes wav, npy, json .to_tuple("concat.dacvae.npy", "target.dacvae.npy", "metadata.json") ) for concat_latent, target_latent, metadata in dataset: emotion = metadata["emotion_bucket_label"] cosine = metadata["cosine_similarity_score"] # concat_latent shape: (T, 128) — speaker_ref + sine + emotion_ref # target_latent shape: (T, 128) — ground truth emotional speech ... ``` ### With WebDataset (individual components) ```python dataset = ( wds.WebDataset(shards) .decode() .to_tuple( "speaker_ref.dacvae.npy", # Sample A latent (voice identity) "emotion_ref.dacvae.npy", # Sample B latent (emotional delivery) "target.dacvae.npy", # Ground truth target latent "metadata.json", ) ) for speaker_latent, emotion_latent, target_latent, metadata in dataset: # speaker_latent: neutral content, target voice identity # emotion_latent: emotional content, neutral voice identity # target_latent: ground truth (target voice + target emotion) ... ``` ### Manual tar extraction ```python import tarfile import numpy as np import json with tarfile.open("shard-gpu00-00000.tar") as tf: for member in tf: if member.name.endswith(".metadata.json"): data = json.loads(tf.extractfile(member).read()) key = member.name.replace(".metadata.json", "") print(f"{key}: {data['emotion_bucket_label']} cosine={data['cosine_similarity_score']:.3f}") ``` ### Decoding DACVAE latents back to audio ```python from dacvae import DACVAE from huggingface_hub import hf_hub_download import torch import numpy as np weights = hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth") dacvae = DACVAE.load(weights).to("cuda").eval() latent = np.load("sample.dacvae.npy") # shape (T, 128) z = torch.from_numpy(latent.astype(np.float32)).T.unsqueeze(0).to("cuda") with torch.no_grad(): audio = dacvae.decode(z).squeeze(0).cpu() # audio shape: (1, num_samples), sample_rate = 48000 ``` ### Generating the 10kHz sine separator The separator between Sample A and Sample B in `concat.wav` is a 1-second 10kHz sine tone at 0.5 amplitude. This acts as a clear delimiter the model can learn to recognize. ```python import torch import math def generate_sine_separator(sample_rate=44100, freq=10000, duration=1.0, amplitude=0.5): """Generate the 10kHz sine tone separator used between speaker_ref and emotion_ref.""" t = torch.linspace(0, duration, int(sample_rate * duration)) sine = (amplitude * torch.sin(2 * math.pi * freq * t)).unsqueeze(0) # shape: (1, num_samples) return sine separator = generate_sine_separator() # separator shape: (1, 44100) — 1 channel, 1 second at 44.1kHz ``` ### Concatenating speaker_ref + separator + emotion_ref To build the concatenated input from individual components (e.g., at inference time or if you want to reconstruct `concat.wav` from the separate files): ```python import torch import torchaudio import math def generate_sine_separator(sr=44100, freq=10000, dur=1.0): t = torch.linspace(0, dur, int(sr * dur)) return (0.5 * torch.sin(2 * math.pi * freq * t)).unsqueeze(0) # From wav files speaker_ref, sr = torchaudio.load("speaker_ref.wav") # (1, T1) at 44100Hz emotion_ref, sr = torchaudio.load("emotion_ref.wav") # (1, T2) at 44100Hz separator = generate_sine_separator(sr=sr) # (1, 44100) concat = torch.cat([speaker_ref, separator, emotion_ref], dim=1) torchaudio.save("concat.wav", concat, sr) ``` From DACVAE latents (for latent-space training): ```python import numpy as np from dacvae import DACVAE from huggingface_hub import hf_hub_download import torch import torchaudio # Load DACVAE weights = hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth") dacvae = DACVAE.load(weights).to("cuda").eval() DACVAE_SR = 48000 ECHO_SR = 44100 def decode_latent(dacvae, npy_path, device="cuda"): latent = np.load(npy_path) z = torch.from_numpy(latent.astype(np.float32)).T.unsqueeze(0).to(device) with torch.no_grad(): return dacvae.decode(z).squeeze(0).cpu() # (1, T) at 48kHz def encode_audio(dacvae, audio, device="cuda"): with torch.no_grad(): z = dacvae.encode(audio.unsqueeze(0).to(device)) return z.squeeze(0).T.cpu().numpy() # (T, 128) # Decode individual latents → 48kHz audio speaker_48k = decode_latent(dacvae, "speaker_ref.dacvae.npy") emotion_48k = decode_latent(dacvae, "emotion_ref.dacvae.npy") # Resample to 44.1kHz for concatenation resample = torchaudio.transforms.Resample(DACVAE_SR, ECHO_SR) speaker_44k = resample(speaker_48k) emotion_44k = resample(emotion_48k) # Generate separator at 44.1kHz, then concatenate separator = generate_sine_separator(sr=ECHO_SR) concat_44k = torch.cat([speaker_44k, separator, emotion_44k], dim=1) # Resample back to 48kHz and encode to DACVAE latent concat_48k = torchaudio.transforms.Resample(ECHO_SR, DACVAE_SR)(concat_44k) concat_latent = encode_audio(dacvae, concat_48k) np.save("concat.dacvae.npy", concat_latent) ``` Note: The pre-built `concat.dacvae.npy` in the shards is the recommended way to use the concatenated input. Only rebuild from components if you need to modify the separator or combine different speaker/emotion refs at inference time. ## Quality Filtering The `cosine_similarity_score` in metadata measures how well the generated emotional speech matches the target's emotion profile (40-dim emotion vector cosine similarity, excluding quality scores). Use this to filter: ```python # High-quality subset (cosine > 0.85) dataset = ( wds.WebDataset(shards) .decode() .select(lambda sample: json.loads(sample["metadata.json"])["cosine_similarity_score"] > 0.85) ) ``` ## Models Used | Component | Model | Source | |-----------|-------|--------| | Audio autoencoder | DACVAE | `mrfakename/dacvae-watermarked` | | Voice conversion | Chatterbox VC | `chatterbox-tts` (Resemble AI) | | TTS generation | Open Echo TTS | `jordand/echo-tts-base` | | Emotion scoring | Empathic Insight Voice+ | `laion/BUD-E-Whisper` + `laion/Empathic-Insight-Voice-Plus` | | Text paraphrase | Gemini 2.5 Flash | Google Gemini API | ## Source Dataset `TTS-AGI/emotion-attribute-conditioning-dacvae` on Hugging Face — 88,171 annotated audio samples across 40 emotion categories, stored as DACVAE latents with metadata (transcription, caption, emotion scores, pitch, gender). ## File Structure ``` dataset_output/ shard-gpu00-00000.tar # WebDataset shard from GPU 0, batch 0 shard-gpu00-00001.tar # ... batch 1 (created when batch 0 reaches 2000 samples) shard-gpu01-00000.tar # WebDataset shard from GPU 1 ... checkpoint_gpu0.json # List of completed job IDs for GPU 0 checkpoint_gpu1.json # ... ... README.md # This file ``` ## Resuming / Monitoring The pipeline is fully resumable. If workers crash, just relaunch: ```bash cd /home/deployer/laion/echo-tts-training-main/emotion_eval LD_LIBRARY_PATH="" nohup /home/deployer/laion/spiritvenv/bin/python pipeline_launch.py > jobs_full/launcher.log 2>&1 & ``` Monitor progress: ```bash # Quick count for i in 0 1 2 3 4 5 6 7; do echo -n "GPU $i: " python3 -c "import json; print(len(json.load(open('dataset_output/checkpoint_gpu${i}.json'))))" done # Live worker logs tail -f jobs_full/gpu_0.log ```

提供机构：

TTS-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集