TTS-AGI/voice-acting-pipeline-output
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/voice-acting-pipeline-output
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-to-speech
- audio-classification
language:
- en
tags:
- emotion
- tts
- voice-conversion
- webdataset
- synthetic
- empathic-insight
- echo-tts
- chatterbox-vc
pretty_name: Voice Acting Pipeline Output
size_categories:
- 10K<n<100K
---
# Voice Acting Pipeline Output
A synthetic emotional speech dataset generated by the **Voice Acting Pipeline** -- an automated, multi-GPU data generation system that produces controlled emotional TTS training data with disentangled speaker identity and emotion prosody.
Each sample consists of **6 audio generations** (3 emotional + 3 neutral sentences) spoken by a consistent speaker, scored by [Empathic Insight Voice+](https://github.com/LAION-AI/emotion-annotations) across **59 perceptual dimensions** (55 emotion/attribute + 4 quality).
---
## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [How the Pipeline Works](#how-the-pipeline-works)
3. [Data Format](#data-format)
4. [Score Interpretation](#score-interpretation)
5. [Dimensions and Buckets](#dimensions-and-buckets)
6. [Installation and Replication](#installation-and-replication)
7. [GPU Requirements](#gpu-requirements)
8. [Pipeline Architecture](#pipeline-architecture)
9. [Configuration Reference](#configuration-reference)
10. [Troubleshooting](#troubleshooting)
---
## Dataset Overview
### What is this?
This dataset contains synthetic speech samples designed for training zero-shot voice and emotion cloning models. For each "sample", the pipeline:
1. Selects an **emotion reference** audio clip from a specific emotion dimension and intensity bucket
2. Optionally **voice-converts** it to a random LAION reference speaker (90% of the time)
3. Generates an **emotional sentence** and a **neutral/boring sentence** via LLM
4. Synthesizes **3 versions** of each sentence (different random seeds) using **Echo TTS** with the reference speaker's voice
5. Scores all 6 audio outputs with **Empathic Insight Voice+** (59 dimensions + caption)
### Data Sources
| Source | Description | Link |
|--------|-------------|------|
| Emotion References | DACVAE-encoded emotional audio snippets with EI scores, bucketed by dimension and intensity | [TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave](https://huggingface.co/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave) |
| Reference Speakers | 3,000 clustered reference voice clips at 48kHz | [laion/clustered-reference-voices](https://huggingface.co/datasets/laion/clustered-reference-voices) |
| Emotion Scoring | Whisper encoder + 59 MLP experts | [laion/Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) |
| TTS Model | Echo TTS (diffusion-based, 44.1kHz output) | [Open Echo TTS](https://github.com/LAION-AI/open-echo-tts) |
| Voice Conversion | ChatterboxVC (24kHz output) | [Chatterbox](https://github.com/ResembleAI/chatterbox) |
| LLM (Sentence Gen) | Gemini 2.5 Flash Lite (or LFM 2.5 1.2B via VLLM) | Google Generative AI API |
| Topics | 471 diverse conversation topics | Included in `code/topics.json` |
### Scale
- **38 emotion/attribute dimensions** with variable bucket ranges
- **144 dimension-bucket combinations** in the source dataset
- **10 samples per bucket**, each with 6 audio generations = **8,640 total WAVs**
- Each WAV scored across **59 perceptual dimensions** + captioned
---
## How the Pipeline Works
### High-Level Flow
```
For each of 38 dimensions (Anger, Amusement, Affection, Age, ...):
For each intensity bucket (e.g., 0to1, 1to2, 2to3, ...):
Download emotion reference audio clips from HuggingFace
Repeat 10 times (10 "samples"):
1. Pick a random emotion reference clip from this bucket
2. Roll d10:
- 10% chance: keep original speaker identity
- 90% chance: voice-convert to a random LAION reference speaker
3. Prepare speaker reference audio (resample to 44.1kHz, trim to 6-15s)
4. Sample a random topic from 471 topics
5. Generate EMOTIONAL sentence via LLM:
- Random starting letter (A-Z)
- Random word count (10-70 words)
- Random punctuation profile (!, ?, ...)
- Emotion/intensity matching the current bucket
6. Generate NEUTRAL sentence via LLM:
- Same topic, different starting letter
- Boring, flat, no emotion
7. Synthesize with Echo TTS (40 diffusion steps):
- 3x emotional sentence with speaker reference = 3 WAVs
- 3x neutral sentence with speaker reference = 3 WAVs
(Each with a different random seed)
8. Score all 6 WAVs with Empathic Insight Voice+:
- 55 emotion/attribute scores per WAV
- 4 quality scores per WAV
- 1 caption per WAV (Whisper decoder)
9. Save everything: audio files, metadata, scores, captions
Package 10 samples (60 WAVs + metadata) into a WebDataset .tar
Upload to HuggingFace
Delete local files to save disk space
```
### Detailed Step-by-Step
#### Step 1: Emotion Reference Selection
The pipeline streams tar files from [TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave](https://huggingface.co/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave). Each tar contains up to 100 audio samples ranked by speech quality, with pre-computed EI scores and DACVAE latent representations.
Each sample in the source tar has:
- `.json` -- metadata with all 55 EI scores
- `.target.npy` -- DACVAE latent (shape `(frames, 128)`, fp16)
- `.target.wav` -- decoded audio (48kHz)
The pipeline prefers the `.wav` if available, otherwise decodes the `.npy` via DACVAE on CPU.
#### Step 2: Voice Conversion (90% of samples)
To disentangle speaker identity from emotional prosody, 90% of samples undergo voice conversion:
1. A random speaker is selected from 3,000 LAION clustered reference voices
2. The emotion reference audio is voice-converted to this speaker's identity using **ChatterboxVC**
3. The resulting audio preserves the emotional prosody but has a different speaker identity
This creates training pairs where the same emotion can appear with many different speakers.
#### Step 3: Sentence Generation
Two sentences are generated per sample using **Gemini 2.5 Flash Lite** (or alternatively LFM 2.5 1.2B via VLLM):
**Emotional sentence:**
- Must start with a randomly chosen capital letter
- Target word count between 10-70 (randomly sampled)
- Punctuation profile randomly sampled:
- Exclamation marks: 33% chance of 0, 33% of 1-2, 34% of 3+
- Question marks: same distribution
- Ellipsis ("..."): 50% yes, 50% no
- Must express the target emotion at the intensity corresponding to the bucket
**Neutral sentence:**
- Same topic, different starting letter
- Must be boring, emotionally flat, factual
- No exclamation marks, question marks, or ellipsis
Each sentence is validated (starting letter, word count within +/-40%) and retried up to 3 times.
#### Step 4: Echo TTS Generation
Each sentence is synthesized 3 times with different random seeds using **Echo TTS**:
- 40 diffusion steps
- Speaker reference audio (the voice-converted or original emotion reference)
- Reference audio resampled to 44.1kHz and trimmed to 6-15 seconds
- Output: 44.1kHz WAV files
This produces 6 WAVs per sample: 3 emotional + 3 neutral.
#### Step 5: Empathic Insight Scoring
Each of the 6 WAVs is scored by **Empathic Insight Voice+**:
1. Audio loaded and resampled to 16kHz
2. Capped at 30 seconds
3. Processed through Whisper encoder (BUD-E-Whisper) to get embeddings `[1, 1500, 768]`
4. **55 emotion/attribute experts** score the full embeddings via FullEmbeddingMLP
5. **4 quality experts** score pooled embeddings (mean+min+max+std = 3072-dim) via PooledEmbeddingMLP
6. Whisper decoder generates a **caption** describing the audio content
#### Step 6: Packaging and Upload
Every 10 samples are packaged into a WebDataset `.tar` file and uploaded to this HuggingFace repository.
---
## Data Format
### Tar File Naming
```
{Dimension}_{BucketMin}to{BucketMax}_{RandomID}.tar
```
Examples:
- `Anger_3to4_7835859664.tar`
- `Affection_0to1_4829103756.tar`
- `Pain_5to6_1923847561.tar`
### Contents of Each Tar
Each tar contains 10 samples. Per sample, the files are:
| File Pattern | Description |
|---|---|
| `{sample_id}.emotional_seed{SEED}.wav` | Emotional sentence audio (3 files, one per seed) |
| `{sample_id}.neutral_seed{SEED}.wav` | Neutral sentence audio (3 files, one per seed) |
| `{sample_id}.ref_audio.wav` | Speaker reference audio used for TTS |
| `{sample_id}.json` | Full metadata (see below) |
**Sample ID format:** `{Dimension}_{Bucket}_{Index}` (e.g., `Anger_3to4_005`)
### Metadata JSON Schema
Each `.json` file contains:
```json
{
"sample_id": "Anger_3to4_005",
"dimension": "Anger",
"bucket": [3, 4],
"bucket_str": "3to4",
"voice_conversion": {
"used_vc": true,
"laion_voice": "speaker_0427.mp3",
"vc_elapsed": 8.234
},
"source_ref": {
"sample_id": "original_source_id",
"metadata_keys": ["key1", "key2", "..."]
},
"emotional_sentence": {
"text": "Absolutely furious about the utter disregard for basic safety protocols!!!",
"topic": "workplace safety regulations",
"letter": "A",
"word_count_target": 35,
"word_count_actual": 10,
"punctuation_params": {
"exclamation_count": 3,
"question_count": 0,
"use_ellipsis": false
},
"valid": true,
"attempts": 1
},
"neutral_sentence": {
"text": "Regulations exist to maintain consistent standards across different workplace environments.",
"topic": "workplace safety regulations",
"letter": "R",
"word_count_target": 22,
"word_count_actual": 10,
"valid": true,
"attempts": 1
},
"emotional_generations": [
{
"seed": 482910,
"path": "/tmp/...",
"duration": 4.82,
"elapsed": 5.1,
"ei_scores": {
"Anger": 3.2841,
"Amusement": 0.0123,
"...": "... (59 dimensions total)"
},
"caption": "A woman speaks angrily about workplace issues",
"ei_elapsed": 3.2,
"chars_per_sec": 14.5
},
{ "...": "seed 2" },
{ "...": "seed 3" }
],
"neutral_generations": [
{ "...": "same structure as emotional, 3 entries" }
]
}
```
### Audio Specifications
| Audio Type | Sample Rate | Format | Channels |
|---|---|---|---|
| Emotional/Neutral WAVs | 44,100 Hz | PCM 16-bit WAV | Mono |
| Speaker Reference | 44,100 Hz | PCM 16-bit WAV | Mono |
| Duration Range | ~2-15 seconds | -- | -- |
---
## Score Interpretation
### Empathic Insight Voice+ Dimensions
The model outputs scores across **59 dimensions**: 55 emotion/attribute scores + 4 quality scores.
### Emotional Categories (40 dimensions)
Scores represent **softmax probability** of the emotion being present. Original annotation scale: 0 (not present) to 4 (extremely present). Model outputs can exceed 4 for very intense expressions.
| Score Range | Interpretation |
|---|---|
| 0.0 - 1.0 | Not present or barely detectable |
| 1.0 - 2.0 | Slightly present |
| 2.0 - 3.0 | Moderately present |
| 3.0 - 4.0 | Strongly to extremely present |
| > 4.0 | Extremely/intensely present (beyond training scale) |
**Full list of 40 emotional dimensions:**
| # | Dimension | # | Dimension |
|---|---|---|---|
| 1 | Amusement | 21 | Doubt |
| 2 | Elation | 22 | Fear |
| 3 | Pleasure/Ecstasy | 23 | Distress |
| 4 | Contentment | 24 | Confusion |
| 5 | Thankfulness/Gratitude | 25 | Embarrassment |
| 6 | Affection | 26 | Shame |
| 7 | Infatuation | 27 | Disappointment |
| 8 | Hope/Enthusiasm/Optimism | 28 | Sadness |
| 9 | Triumph | 29 | Bitterness |
| 10 | Pride | 30 | Contempt |
| 11 | Interest | 31 | Disgust |
| 12 | Awe | 32 | Anger |
| 13 | Astonishment/Surprise | 33 | Malevolence/Malice |
| 14 | Concentration | 34 | Sourness |
| 15 | Contemplation | 35 | Pain |
| 16 | Relief | 36 | Helplessness |
| 17 | Longing | 37 | Fatigue/Exhaustion |
| 18 | Teasing | 38 | Emotional Numbness |
| 19 | Impatience & Irritability | 39 | Intoxication/Altered States |
| 20 | Sexual Lust | 40 | Jealousy & Envy |
### Attribute Dimensions (15 dimensions)
These have **varying scales** depending on the dimension:
| Dimension | Scale | Neutral | Interpretation |
|---|---|---|---|
| **Valence** | -3 to +3 | 0 | -3 = extremely negative, +3 = extremely positive |
| **Arousal** | 0 to 4 | 2 | 0 = very calm, 4 = very excited |
| **Submissive vs. Dominant** | -3 to +3 | 0 | -3 = very submissive, +3 = very dominant |
| **Age** | 0 to 6 | -- | 0 = infant, 2 = teenager, 4 = adult, 6 = very old |
| **Gender** | -2 to +2 | 0 | -2 = very masculine, +2 = very feminine |
| **Serious vs. Humorous** | 0 to 4 | 2 | 0 = very serious, 4 = very humorous |
| **Vulnerable vs. Emotionally Detached** | 0 to 4 | 2 | 0 = very vulnerable, 4 = very detached |
| **Confident vs. Hesitant** | 0 to 4 | 2 | 0 = very confident, 4 = very hesitant |
| **Warm vs. Cold** | -2 to +2 | 0 | -2 = very cold, +2 = very warm |
| **Monotone vs. Expressive** | 0 to 4 | 2 | 0 = very monotone, 4 = very expressive |
| **High-Pitched vs. Low-Pitched** | 0 to 4 | 2 | 0 = very high, 4 = very low |
| **Soft vs. Harsh** | -2 to +2 | 0 | -2 = very harsh, +2 = very soft |
| **Authenticity** | 0 to 4 | 2 | 0 = very artificial, 4 = very genuine |
| **Recording Quality** | 0 to 4 | 2 | 0 = very low quality, 4 = excellent |
| **Background Noise** | 0 to 3 | 0 | 0 = clean, 3 = intense noise |
### Quality Scores (4 dimensions)
These are separate from the 55 emotion/attribute scores:
| Dimension | Description |
|---|---|
| `score_overall_quality` | Overall audio quality rating |
| `score_speech_quality` | Speech clarity and naturalness |
| `score_background_quality` | Background audio quality |
| `score_content_enjoyment` | How enjoyable the content is to listen to |
---
## Dimensions and Buckets
### Available Dimension-Bucket Combinations
The dataset covers **38 dimensions** with **144 total buckets**. Each bucket represents an intensity range `[min, max)`.
<details>
<summary>Click to expand full list of 144 buckets</summary>
| Dimension | Available Buckets | Total |
|---|---|---|
| Affection | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Age | 0to1, 1to2, 2to3, 3to4, 4to5, 5to6 | 6 |
| Amusement | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Anger | 0to1, 1to2, 2to3, 3to4, 4to5, 5to6 | 6 |
| Arousal | 0to1, 1to2 | 2 |
| Astonishment_Surprise | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Authenticity | 1to2, 2to3, 3to4, 4to5 | 4 |
| Awe | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Background_Noise | 0to1, 1to2, 2to3 | 3 |
| Bitterness | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Concentration | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Confident_vs._Hesitant | 0to1 | 1 |
| Contemplation | 0to1, 1to2, 2to3, 3to4 | 4 |
| Contempt | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Contentment | 0to1, 1to2, 2to3, 3to4 | 4 |
| Disappointment | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Disgust | 0to1, 1to2, 2to3, 3to4 | 4 |
| Distress | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Embarrassment | 0to1, 1to2, 2to3 | 3 |
| Emotional_Numbness | 0to1, 1to2, 2to3, 3to4 | 4 |
| Fatigue_Exhaustion | 1to2, 2to3, 3to4, 4to5 | 4 |
| Fear | 0to1, 1to2, 2to3, 3to4 | 4 |
| Helplessness | 0to1, 1to2, 2to3, 3to4 | 4 |
| Impatience_and_Irritability | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Infatuation | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Interest | 0to1, 1to2, 2to3, 3to4 | 4 |
| Intoxication_Altered_States | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Jealousy_and_Envy | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Monotone_vs._Expressive | 0to1 | 1 |
| Pain | 0to1, 1to2, 2to3, 3to4, 4to5, 5to6 | 6 |
| Pleasure_Ecstasy | 0to1, 1to2, 2to3, 3to4 | 4 |
| Pride | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| Relief | 0to1, 1to2, 2to3, 3to4, 4to5, 5to6 | 6 |
| Shame | 0to1, 1to2, 2to3, 3to4, 4to5, 5to6 | 6 |
| Soft_vs._Harsh | 0to1, 1to2 | 2 |
| Sourness | 0to1, 1to2, 2to3, 3to4 | 4 |
| Teasing | 0to1, 1to2, 2to3, 3to4 | 4 |
| Vulnerable_vs._Emotionally_Detached | 0to1, 1to2, 2to3, 3to4, 4to5 | 5 |
| **Total** | | **144** |
</details>
### Bucket Interpretation
A bucket like `Anger_3to4` means:
- **Dimension:** Anger (one of 40 emotion categories)
- **Intensity range:** Score between 3.0 and 4.0
- **Meaning:** "Strongly to extremely present anger" (on the 0-4 annotation scale)
The emotion reference audio clips used as TTS conditioning come from this bucket, so the generated speech should exhibit this level of the target emotion.
---
## Installation and Replication
### Prerequisites
- **Hardware:** 4+ NVIDIA GPUs with 20+ GB VRAM each (tested on A100-80GB)
- **OS:** Linux (tested on Debian 13)
- **Python:** 3.10+ (tested on 3.13)
- **Additional:** `spiritvenv` virtualenv with ChatterboxVC (Python 3.13)
### Step 1: Clone and Install Dependencies
```bash
# Clone the pipeline code (or download the code/ folder from this repo)
git clone https://huggingface.co/datasets/TTS-AGI/voice-acting-pipeline-output
cd voice-acting-pipeline-output/code
# Install Python dependencies
pip install -r requirements.txt
# Install fast-dacvae (for decoding emotion reference latents)
pip install git+https://github.com/kadirnar/fast-dacvae.git
# Install VLLM (optional, only if using local LLM instead of Gemini API)
pip install vllm openai
```
### Step 2: Install ChatterboxVC
ChatterboxVC requires a separate Python environment (spiritvenv):
```bash
# Create virtualenv
python3 -m venv /path/to/spiritvenv
source /path/to/spiritvenv/bin/activate
# Install chatterbox
pip install chatterbox
# Update SPIRITVENV_PYTHON path in config.py
deactivate
```
### Step 3: Install Echo TTS
```bash
# Clone Open Echo TTS
git clone https://github.com/LAION-AI/open-echo-tts.git
cd open-echo-tts
pip install -e .
# Update ECHO_TTS_SRC path in config.py
```
### Step 4: Download DACVAE Weights
```bash
python -c "
from dacvae import DACVAE
model = DACVAE.from_pretrained() # downloads to ~/.cache/huggingface/hub/
print('DACVAE ready')
"
# Update DACVAE_WEIGHTS path in config.py if needed
```
### Step 5: Configure API Keys
```bash
# Set your Gemini API key (for sentence generation)
export GEMINI_API_KEY="your-api-key-here"
# Or edit sentence_generator.py to use your key
# Or switch to VLLM backend by setting LLM_BACKEND = "vllm" in sentence_generator.py
```
### Step 6: Configure GPU Allocation
Edit `config.py`:
```python
GPUS = [0, 1, 2] # Your available GPU indices
```
Or use `run_pipeline.py` which has a pre-configured split-GPU setup. Edit the `WORKER_CONFIGS` at the top of `run_pipeline.py`:
```python
WORKER_CONFIGS = [
{"name": "A", "echo_gpu": 0, "echo_port": 9200, "ei_gpu": 1, "ei_port": 9401},
{"name": "B", "echo_gpu": 2, "echo_port": 9202, "ei_gpu": 3, "ei_port": 9403},
]
VC_GPU = 4
VC_PORT = 9304
```
### Step 7: Download Reference Voices
```bash
cd code/
python dataset_loader.py --download-refs
# Downloads 3,000 LAION reference voices (~1GB)
```
### Step 8: Run Smoke Test
```bash
# Start servers manually first
python servers/echo_tts_server.py --gpu 0 --port 9200 &
python servers/vc_server.py --gpu 1 --port 9301 &
python servers/ei_server.py --gpu 2 --port 9402 &
# Wait for servers to be healthy
sleep 10
# Run smoke test (2 samples, generates HTML report)
python test_pipeline.py --gpu 0 --samples 2 --dimension Anger --bucket 3to4 \
--echo-port 9200 --vc-port 9301 --ei-port 9402
```
This generates `test_report_Anger_3to4.html` with embedded audio players for quality inspection.
### Step 9: Run Full Pipeline
```bash
# Option A: Use the optimized launcher (recommended)
python run_pipeline.py
# Option B: Use the master orchestrator
python master.py --gpus 0,1,2,3
# Option C: Run a single worker manually
python worker.py --gpu 0 --echo-port 9200 --vc-port 9301 --ei-port 9402
# Useful flags:
# --no-upload Skip HuggingFace upload
# --dimension Anger Only process "Anger" buckets
```
### Step 10: Monitor Progress
```bash
# Watch worker logs
tail -f logs/worker_A.log
tail -f logs/worker_B.log
# Count completed buckets
ls progress/*.done | wc -l
# Check HuggingFace uploads
python -c "
from huggingface_hub import HfApi
api = HfApi()
files = list(api.list_repo_tree('TTS-AGI/voice-acting-pipeline-output', repo_type='dataset', path_in_repo='data'))
print(f'{len(files)} tars uploaded')
total_mb = sum(f.size for f in files if hasattr(f, 'size')) / 1024 / 1024
print(f'{total_mb:.0f} MB total')
"
```
---
## GPU Requirements
### Memory Estimates Per Service
| Service | VRAM Required | Notes |
|---|---|---|
| Echo TTS | ~14-17 GB | Diffusion model + autoencoder + PCA |
| ChatterboxVC | ~13 GB | Loaded lazily on first VC request |
| Empathic Insight | ~10-13 GB | Whisper encoder + 59 MLP experts |
| VLLM (LFM 2.5 1.2B) | ~5 GB | Optional, only if not using Gemini API |
### Recommended GPU Allocation
**Minimum (2 GPUs, ~40GB each):**
```
GPU 0: Echo TTS + VLLM
GPU 1: EI + VC (lazy)
```
**Optimal (4+ GPUs, 20GB+ each):**
```
GPU 0: Echo TTS (Worker A)
GPU 1: EI (Worker A)
GPU 2: Echo TTS (Worker B)
GPU 3: EI (Worker B)
GPU 4: VC (shared between workers)
```
This dual-worker setup achieves **~2x throughput** compared to a single worker.
### Performance
| Metric | Value |
|---|---|
| Time per sample (warm) | ~30-35 seconds |
| Samples per bucket | 10 |
| Time per bucket | ~5-6 minutes |
| Total buckets | 144 |
| Total time (1 worker) | ~14 hours |
| Total time (2 workers) | ~7 hours |
| WAVs generated per hour | ~250-350 |
---
## Pipeline Architecture
### Code Structure
```
code/
├── config.py # Central configuration, dimensions, ports, paths
├── dataset_loader.py # HF dataset streaming, DACVAE decode, audio I/O
├── sentence_generator.py # LLM prompt engineering, validation, Gemini/VLLM
├── worker.py # Per-worker generation loop (optimized, pipelined)
├── uploader.py # WebDataset tar packaging + HF upload
├── run_pipeline.py # Optimized multi-worker launcher
├── master.py # Full orchestrator (starts servers + workers)
├── worker_runner.py # Queue file processor for master.py
├── test_pipeline.py # Smoke test with HTML report generation
├── topics.json # 471 conversation topics
├── requirements.txt # Python dependencies
├── install.sh # Setup script
└── servers/
├── echo_tts_server.py # FastAPI: Echo TTS generation
├── vc_server.py # FastAPI: ChatterboxVC voice conversion
├── ei_server.py # FastAPI: Empathic Insight scoring + captioning
└── vllm_server.py # VLLM launch wrapper for LFM 2.5
```
### Server API Reference
#### Echo TTS Server (`POST /generate`)
| Parameter | Type | Description |
|---|---|---|
| `text` | string | Text to synthesize |
| `ref_audio_path` | string | Path to reference speaker audio (44.1kHz WAV) |
| `seed` | int | Random seed for reproducibility |
| `num_steps` | int | Diffusion steps (default: 40) |
**Response:** `{ "status": "ok", "output_path": "...", "duration": 4.82, "elapsed": 5.1 }`
#### VC Server (`POST /convert`)
| Parameter | Type | Description |
|---|---|---|
| `source_path` | string | Path to source audio to convert |
| `target_path` | string | Path to target speaker audio (identity to clone) |
**Response:** `{ "status": "ok", "output_path": "...", "sample_rate": 24000, "elapsed": 8.2 }`
#### EI Server (`POST /score`)
| Parameter | Type | Description |
|---|---|---|
| `audio_path` | string | Path to audio file to score |
**Response:** `{ "status": "ok", "scores": { "Anger": 3.28, ... }, "caption": "...", "elapsed": 3.1 }`
#### Health Check (all servers: `GET /health`)
**Response:** `{ "status": "ok", "model_loaded": true, "device": "cuda:0" }`
### Port Scheme
```
VLLM: port 9100 (shared, optional)
Echo TTS GPU N: port 9200 + N
VC GPU N: port 9300 + N
EI GPU N: port 9400 + N
```
### Speed Optimizations
The pipeline includes several optimizations for throughput:
1. **Dual workers on separate GPU pairs** -- 2x throughput
2. **Pipelined TTS + EI** -- EI scoring (GPU X) starts immediately as each TTS output (GPU Y) completes, overlapping computation on different GPUs
3. **Concurrent sentence generation** -- Both emotional and neutral sentences generated in parallel via ThreadPoolExecutor
4. **Lazy model loading** -- Servers start instantly, models load on first request
5. **Progress tracking** -- `.done` files prevent re-processing on restart
6. **Streaming dataset access** -- Emotion references streamed from HF, not bulk-downloaded
---
## Configuration Reference
### Key Constants (`config.py`)
| Constant | Value | Description |
|---|---|---|
| `SAMPLES_PER_BUCKET` | 10 | Number of samples to generate per bucket |
| `SEEDS_PER_SAMPLE` | 3 | Number of random seeds per sentence |
| `ECHO_TTS_STEPS` | 40 | Diffusion steps for TTS generation |
| `WORD_COUNT_MIN` | 10 | Minimum target word count |
| `WORD_COUNT_MAX` | 70 | Maximum target word count |
| `SPEAKER_REF_MIN_DURATION` | 6.0s | Minimum reference audio length |
| `SPEAKER_REF_MAX_DURATION` | 15.0s | Maximum reference audio length |
### Sample Rates
| Component | Sample Rate |
|---|---|
| Echo TTS output | 44,100 Hz |
| ChatterboxVC output | 24,000 Hz |
| DACVAE decode | 48,000 Hz |
| Empathic Insight input | 16,000 Hz (resampled internally) |
| Final output WAVs | 44,100 Hz |
### Environment Fixes
These fixes are applied automatically in all server scripts:
```python
# Fix cuDNN library path conflict (wrong conda env)
if "ml-general" in os.environ.get("LD_LIBRARY_PATH", ""):
os.environ["LD_LIBRARY_PATH"] = ""
# Disable cuDNN (version mismatch)
torch.backends.cudnn.enabled = False
# Disable torch dynamo (tensordict import assertion errors)
os.environ["TORCHDYNAMO_DISABLE"] = "1"
```
If you encounter `OSError` or `AssertionError` related to cuDNN or torch._dynamo, ensure these fixes are applied before any torch imports.
---
## Troubleshooting
### Common Issues
**1. "No LAION reference voices found"**
```bash
python dataset_loader.py --download-refs
```
**2. Echo TTS server crashes with `AssertionError`**
Ensure `TORCHDYNAMO_DISABLE=1` is set before torch import. This is already handled in the server scripts.
**3. "Address already in use" when starting servers**
```bash
# Find and kill existing server on the port
lsof -i :9201 | grep LISTEN
kill <PID>
```
**4. HuggingFace upload 404**
Create the repo first:
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="TTS-AGI/voice-acting-pipeline-output", repo_type="dataset", exist_ok=True)
```
**5. DACVAE decode is slow**
DACVAE runs on CPU due to cuDNN version mismatch. This is expected (~1-2s per decode). GPU acceleration requires matching cuDNN versions.
**6. torchcodec / FFmpeg errors**
Use `soundfile` (sf.read/sf.write) instead of `torchaudio.save/load`. All pipeline code already uses soundfile.
**7. ChatterboxVC subprocess dies**
The VC server automatically restarts its subprocess. Check the logs:
```bash
tail -f logs/vc_gpu2.log
```
### Verifying a Generated Tar
```python
import tarfile, json
with tarfile.open("data/Anger_3to4_1234567890.tar") as tf:
for m in tf.getmembers():
print(f" {m.name} ({m.size/1024:.1f} KB)")
# Read a metadata JSON
for m in tf.getmembers():
if m.name.endswith(".json"):
data = json.loads(tf.extractfile(m).read())
print(f"\nSample: {data['sample_id']}")
print(f"Emotion: {data['emotional_sentence']['text']}")
print(f"Neutral: {data['neutral_sentence']['text']}")
print(f"Scores: {list(data['emotional_generations'][0]['ei_scores'].keys())[:5]}...")
break
```
---
## License
Apache 2.0
## Citation
If you use this dataset or pipeline, please cite:
```bibtex
@misc{voice-acting-pipeline-2025,
title={Voice Acting Pipeline: Automated Emotional Speech Dataset Generation},
author={LAION and TTS-AGI},
year={2025},
url={https://huggingface.co/datasets/TTS-AGI/voice-acting-pipeline-output}
}
```
## Acknowledgments
- **Echo TTS** by LAION / Jordan Meyer
- **ChatterboxVC** by Resemble AI
- **Empathic Insight Voice+** by LAION
- **BUD-E-Whisper** by LAION
- **Gemini API** by Google
- **DACVAE** by Meta (Facebook Research)
- **Clustered Reference Voices** by LAION
提供机构:
TTS-AGI
搜集汇总
数据集介绍

构建方式
在语音合成与情感计算领域,构建高质量的情感语音数据集对于推动零样本语音克隆与情感建模研究至关重要。Voice Acting Pipeline Output 数据集通过一套自动化、多GPU协同的数据生成系统构建而成。该系统首先从预标注的情感参考音频库中,依据38个情感维度及144个强度分桶选取样本;随后,90%的样本会通过语音转换技术将情感韵律与随机选择的说话人身份进行解耦,以增强数据的多样性。每个样本会基于随机话题,利用大型语言模型生成一句情感语句和一句中性语句,再通过Echo TTS扩散模型,以不同随机种子各合成三次,最终为每个样本产出六段音频。所有生成音频均经由Empathic Insight Voice+模型在59个感知维度上进行评分,并附带文本描述,最终以WebDataset格式打包上传。
使用方法
该数据集主要应用于训练先进的零样本语音与情感克隆模型。研究者可通过HuggingFace平台直接下载以.tar格式组织的WebDataset数据包。每个数据包对应特定的情感维度与强度分桶,内含音频文件及结构化的JSON元数据。元数据详细记录了语音转换信息、语句生成参数、扩散模型种子以及全面的情感评分。在使用时,开发者可以依据维度标签筛选所需的情感类别,利用配对的情感与中性音频进行对比学习,或借助精细的评分数据训练情感感知模型。数据集的构建代码与配置均已开源,支持研究者在具备多GPU的环境中复现或扩展此数据生成流程,以适配特定的研究需求。
背景与挑战
背景概述
在语音合成与情感计算领域,高质量、可控的情感语音数据长期匮乏,制约了零样本语音克隆与情感建模技术的发展。为应对此瓶颈,LAION-AI等研究机构于近期构建了Voice Acting Pipeline Output数据集。该数据集通过自动化多GPU生成系统,系统性产出解耦了说话人身份与情感韵律的合成语音样本,每个样本包含六段音频生成(三段情感句与三段中性句),并经由Empathic Insight Voice+模型在59个感知维度上进行评分。其核心研究目标在于为可控情感文本到语音转换及零样本语音克隆模型提供大规模、细粒度标注的训练数据,推动合成语音在表现力与自然度上的边界拓展。
当前挑战
该数据集旨在解决情感语音合成领域的两大核心挑战:一是如何实现对多种离散情感维度及其连续强度水平的精确建模与控制;二是在生成过程中如何有效解耦说话人身份特征与情感韵律表征,以支持零样本跨说话人的情感语音克隆。在构建层面,挑战同样显著:首先,自动化流水线需协调多个异构模型(如Echo TTS、ChatterboxVC、大型语言模型及情感评分模型),对系统稳定性与计算资源调度提出极高要求;其次,确保生成语句在内容多样性、语法正确性及与目标情感强度严格对齐方面存在困难;最后,大规模合成数据的感知质量评估与标准化标注流程,亦是保障数据集可信度与实用性的关键难题。
常用场景
经典使用场景
在情感语音合成领域,Voice Acting Pipeline Output数据集为训练零样本语音与情感克隆模型提供了标准化实验平台。该数据集通过自动化流程生成包含38种情感维度的合成语音样本,每个样本均配备情感强度分级与多维度感知评分,使得研究人员能够系统性地探索语音特征与情感表达的映射关系。其经典应用场景在于构建可控的情感语音生成系统,通过解耦说话人身份与情感韵律,为跨说话人的情感迁移研究提供了高质量数据支撑。
解决学术问题
该数据集有效解决了情感语音合成研究中数据稀缺与标注成本高昂的核心难题。传统情感语音数据集往往受限于有限的说话人数量与情感类别,而本数据集通过自动化生成系统,构建了覆盖144种情感强度组合的大规模标注语料。其意义在于为情感感知建模、跨模态情感迁移以及语音合成系统的可控性评估提供了标准化基准,推动了语音合成技术向更具表现力与泛化能力的方向发展。
实际应用
在实际应用层面,该数据集为智能语音助手、虚拟角色配音以及情感化人机交互系统提供了关键技术支撑。基于该数据集训练的模型能够生成具有丰富情感色彩的合成语音,显著提升语音交互的自然度与感染力。在数字娱乐产业中,该技术可用于游戏角色配音的自动化生成;在心理健康辅助领域,则能为情感陪伴型机器人提供更拟人化的语音表达能力,拓展了语音合成技术在多元化场景中的落地可能。
数据集最近研究
最新研究方向
在语音合成与情感计算领域,Voice Acting Pipeline Output数据集正推动零样本语音克隆与情感建模的前沿探索。该数据集通过自动化流程生成解耦说话人身份与情感韵律的合成语音,为构建可控、高表现力的情感语音合成系统提供了关键训练资源。当前研究聚焦于利用其精细标注的59维感知特征,开发能够精准捕捉并迁移复杂情感状态的生成模型,同时探索跨说话人的情感属性泛化能力。这一方向与交互式人工智能、虚拟数字人及情感化人机交互等热点应用紧密相连,旨在提升合成语音的自然度与情感表现力,对推动个性化语音助手、情感陪伴机器人等技术的发展具有重要价值。
以上内容由遇见数据集搜集并总结生成



