TTS-AGI/emotion-attribute-conditioning-dacvae
收藏Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae
下载链接
链接失效反馈官方服务:
资源简介:
# Echo TTS - Emotion & Attribute Conditioning Dataset (DAC-VAE Latents)
Pre-bucketed speech dataset with **DAC-VAE latent representations** organized by 40 emotion categories and 13 vocal/audio attributes. Built for conditioning fine-tuning of [Echo TTS](https://huggingface.co/TTS-AGI/echo-800m-v1-ckpts) and similar DiT-based TTS models.
## Overview
- **Total emotion samples**: 163,271 (across 40 emotions, 10K cap per emotion)
- **Total attribute samples**: ~785K (across 13 attributes x 7 buckets, 10K cap per bucket)
- **Format**: WebDataset `.tar` files (312 total)
- **Latent format**: DAC-VAE 128-dim at 25 fps (variable length)
- **Source**: Filtered from 71.2M samples across 3 datasets
### Source Datasets
1. [TTS-AGI/podcast-tokenized-bg3.5-enj5](https://huggingface.co/datasets/TTS-AGI/podcast-tokenized-bg3.5-enj5) (481 tars)
2. [TTS-AGI/podcast-tokenized-bg2.5-enj4.5](https://huggingface.co/datasets/TTS-AGI/podcast-tokenized-bg2.5-enj4.5) (6,496 tars)
3. [TTS-AGI/emolia-hq-tokenized](https://huggingface.co/datasets/TTS-AGI/emolia-hq-tokenized) (10,491 tars)
### Annotation Source
All emotion and attribute scores from [LAION Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) via [emotion-annotations](https://github.com/LAION-AI/emotion-annotations).
## Data Format
Each sample in the tar files contains 3 files:
| File | Shape | Description |
|------|-------|-------------|
| `{key}.npy` | `(frames, 128)` float32 | DAC-VAE latent (target speech, 25 fps) |
| `{key}.ref.npy` | `(ref_frames, 128)` float32 | DAC-VAE latent (speaker reference) |
| `{key}.json` | - | Metadata with text, scores, speaker info |
### JSON Metadata Fields
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Transcript text |
| `annotation_scores` | dict (59 keys) | All emotion + attribute scores |
| `caption` | string | Natural language description of the audio |
| `target_duration` | float | Duration in seconds |
| `context_duration` | float | Speaker reference duration |
| `latent_frames` | int | Number of latent frames in .npy |
| `ref_latent_frames` | int | Number of frames in ref.npy |
| `latent_dim` | int | Always 128 |
| `speaker` | string | Speaker ID |
| `language` | string | Language code |
| `episode_id` | int | Source episode identifier |
### Loading Example
```python
import webdataset as wds
import numpy as np
import json
dataset = wds.WebDataset("emotions/Amusement/000000.tar")
for sample in dataset:
key = sample["__key__"]
latent = np.frombuffer(sample["npy"], dtype=np.float32).reshape(-1, 128)
ref_latent = np.frombuffer(sample["ref.npy"], dtype=np.float32).reshape(-1, 128)
metadata = json.loads(sample["json"])
text = metadata["text"]
scores = metadata["annotation_scores"]
print(f"{key}: {latent.shape}, text='{text[:60]}...', amusement={scores.get('Amusement', 0):.2f}")
```
## Directory Structure
```
emotions/
{emotion_name}/ # 40 emotion directories
000000.tar # WebDataset tar (up to 4096 samples each)
000001.tar
...
attributes/
{attribute_name}/ # 13 attribute directories
bucket_{i}_{lo}_to_{hi}/ # 7 range buckets per attribute
000000.tar
...
```
## Emotion Buckets (40 emotions)
Samples are routed to the emotion with the highest score, if that score >= 2.5. Each sample appears in at most one emotion bucket.
| Emotion | Samples | | Emotion | Samples |
|---------|--------:|-|---------|--------:|
| Interest | 10,000 | | Pain | 1,220 |
| Hope_Enthusiasm_Optimism | 10,000 | | Shame | 1,215 |
| Thankfulness_Gratitude | 10,000 | | Infatuation | 934 |
| Amusement | 10,000 | | Malevolence_Malice | 874 |
| Concentration | 10,000 | | Doubt | 507 |
| Impatience_and_Irritability | 10,000 | | Disgust | 452 |
| Contemplation | 10,000 | | Awe | 421 |
| Anger | 10,000 | | Contentment | 376 |
| Astonishment_Surprise | 10,000 | | Contempt | 348 |
| Affection | 10,000 | | Teasing | 282 |
| Confusion | 9,756 | | Jealousy_&_Envy | 203 |
| Intoxication_Altered_States | 9,375 | | Helplessness | 194 |
| Distress | 6,182 | | Embarrassment | 96 |
| Fatigue_Exhaustion | 5,672 | | Bitterness | 74 |
| Longing | 4,452 | | Pleasure_Ecstasy | 45 |
| Relief | 3,822 | | Sourness | 27 |
| Sadness | 3,634 | | | |
| Elation | 3,042 | | | |
| Triumph | 2,001 | | | |
| Sexual_Lust | 1,967 | | | |
| Emotional_Numbness | 1,721 | | | |
| Fear | 1,676 | | | |
| Pride | 1,374 | | | |
| Disappointment | 1,329 | | | |
**Total: 163,271 samples across 40 emotions**
## Attribute Buckets (13 attributes x 7 levels)
Each attribute is divided into 7 linear buckets. Samples can appear in multiple attribute buckets (one per attribute).
### Age (0.0 - 6.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | 0.0 - 0.9 | 10,000 |
| 1 | 0.9 - 1.7 | 10,000 |
| 2 | 1.7 - 2.6 | 10,000 |
| 3 | 2.6 - 3.4 | 10,000 |
| 4 | 3.4 - 4.3 | 10,000 |
| 5 | 4.3 - 5.1 | 1,068 |
| 6 | 5.1 - 6.0 | 3 |
### Arousal (0.0 - 4.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | 0.0 - 0.6 | 10,000 |
| 1 | 0.6 - 1.1 | 10,000 |
| 2 | 1.1 - 1.7 | 10,000 |
| 3 | 1.7 - 2.3 | 10,000 |
| 4 | 2.3 - 2.9 | 10,000 |
| 5 | 2.9 - 3.4 | 10,000 |
| 6 | 3.4 - 4.0 | 5,402 |
### Confident vs. Hesitant (0.0 - 4.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | 0.0 - 0.6 | 10,000 |
| 1 | 0.6 - 1.1 | 10,000 |
| 2 | 1.1 - 1.7 | 10,000 |
| 3 | 1.7 - 2.3 | 10,000 |
| 4 | 2.3 - 2.9 | 10,000 |
| 5 | 2.9 - 3.4 | 10,000 |
| 6 | 3.4 - 4.0 | 1,376 |
### Gender (-2.0 - 2.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | -2.0 - -1.4 | 10,000 |
| 1 | -1.4 - -0.9 | 10,000 |
| 2 | -0.9 - -0.3 | 10,000 |
| 3 | -0.3 - 0.3 | 10,000 |
| 4 | 0.3 - 0.9 | 10,000 |
| 5 | 0.9 - 1.4 | 10,000 |
| 6 | 1.4 - 2.0 | 10,000 |
### Monotone vs. Expressive (0.0 - 4.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | 0.0 - 0.6 | 4,928 |
| 1 | 0.6 - 1.1 | 10,000 |
| 2 | 1.1 - 1.7 | 10,000 |
| 3 | 1.7 - 2.3 | 10,000 |
| 4 | 2.3 - 2.9 | 10,000 |
| 5 | 2.9 - 3.4 | 10,000 |
| 6 | 3.4 - 4.0 | 10,000 |
### Serious vs. Humorous (0.0 - 4.0)
All 7 buckets at 10,000 samples.
### Soft vs. Harsh (-2.0 - 2.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | -2.0 - -1.4 | 349 |
| 1-5 | ... | 10,000 each |
| 6 | 1.4 - 2.0 | 1,176 |
### Valence (-3.0 - 3.0)
All 7 buckets at 10,000 samples.
### Warm vs. Cold (-2.0 - 2.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | -2.0 - -1.4 | 420 |
| 1-6 | ... | 10,000 each |
### Duration (1.0 - 30.0 seconds)
All 7 buckets at 10,000 samples.
### Talking Speed (5.0 - 25.0 CPS)
All 7 buckets at 10,000 samples.
### High-Pitched vs. Low-Pitched (0.0 - 4.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 0 | 0.0 - 0.6 | 6 |
| 1-5 | ... | 10,000 each |
| 6 | 3.4 - 4.0 | 10 |
### Submissive vs. Dominant (-3.0 - 3.0)
| Bucket | Range | Samples |
|--------|-------|--------:|
| 2 | -1.3 - -0.4 | 162 |
| 3 | -0.4 - 0.4 | 10,000 |
| 4 | 0.4 - 1.3 | 10,000 |
| 5 | 1.3 - 2.1 | 10,000 |
| 6 | 2.1 - 3.0 | 86 |
## 59 Annotation Score Dimensions
The `annotation_scores` field in each JSON contains all 59 dimensions from [Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus):
**Emotions (40):** Affection, Amusement, Anger, Astonishment_Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional_Numbness, Fatigue_Exhaustion, Fear, Helplessness, Hope_Enthusiasm_Optimism, Impatience_and_Irritability, Infatuation, Interest, Intoxication_Altered_States_of_Consciousness, Jealousy_&_Envy, Longing, Malevolence_Malice, Pain, Pleasure_Ecstasy, Pride, Relief, Sadness, Sexual_Lust, Shame, Sourness, Teasing, Thankfulness_Gratitude, Triumph
**Vocal Attributes (15):** Age, Arousal, Authenticity, Background_Noise, Confident_vs._Hesitant, Gender, High-Pitched_vs._Low-Pitched, Monotone_vs._Expressive, Recording_Quality, Serious_vs._Humorous, Soft_vs._Harsh, Submissive_vs._Dominant, Valence, Vulnerable_vs._Emotionally_Detached, Warm_vs._Cold
**Audio Quality (4):** score_background_quality, score_content_enjoyment, score_overall_quality, score_speech_quality
## Bucketing Strategy
**Emotion routing:** Each sample is assigned to the emotion with the highest score, if that score >= 2.5. Samples below threshold or with no strong emotion are excluded. Each sample appears in at most 1 emotion bucket.
**Attribute routing:** 7 linearly-spaced buckets per attribute. Each sample is placed into all matching attribute buckets (one per attribute dimension). Bucket size capped at 10,000 samples.
## Pipeline Details
- **Samples scanned:** 71,242,881
- **Processing time:** ~15 hours
- **Datasets processed:** 17,468 tar files across 3 HuggingFace datasets
## Citation
```bibtex
@misc{echo-conditioning-data,
title={Echo TTS Emotion & Attribute Conditioning Dataset},
url={https://huggingface.co/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae},
note={Bucketed DAC-VAE latent speech data with 59-dimensional emotion/attribute annotations},
}
```
Scores from [LAION Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) and [emotion-annotations](https://github.com/LAION-AI/emotion-annotations).
提供机构:
TTS-AGI



