TTS-AGI/emotion-attribute-conditioning-dacvae

Name: TTS-AGI/emotion-attribute-conditioning-dacvae
Creator: TTS-AGI
Published: 2026-03-15 11:28:12
License: 暂无描述

Hugging Face2026-03-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae

下载链接

链接失效反馈

官方服务：

资源简介：

# Echo TTS - Emotion & Attribute Conditioning Dataset (DAC-VAE Latents) Pre-bucketed speech dataset with **DAC-VAE latent representations** organized by 40 emotion categories and 13 vocal/audio attributes. Built for conditioning fine-tuning of [Echo TTS](https://huggingface.co/TTS-AGI/echo-800m-v1-ckpts) and similar DiT-based TTS models. ## Overview - **Total emotion samples**: 163,271 (across 40 emotions, 10K cap per emotion) - **Total attribute samples**: ~785K (across 13 attributes x 7 buckets, 10K cap per bucket) - **Format**: WebDataset `.tar` files (312 total) - **Latent format**: DAC-VAE 128-dim at 25 fps (variable length) - **Source**: Filtered from 71.2M samples across 3 datasets ### Source Datasets 1. [TTS-AGI/podcast-tokenized-bg3.5-enj5](https://huggingface.co/datasets/TTS-AGI/podcast-tokenized-bg3.5-enj5) (481 tars) 2. [TTS-AGI/podcast-tokenized-bg2.5-enj4.5](https://huggingface.co/datasets/TTS-AGI/podcast-tokenized-bg2.5-enj4.5) (6,496 tars) 3. [TTS-AGI/emolia-hq-tokenized](https://huggingface.co/datasets/TTS-AGI/emolia-hq-tokenized) (10,491 tars) ### Annotation Source All emotion and attribute scores from [LAION Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) via [emotion-annotations](https://github.com/LAION-AI/emotion-annotations). ## Data Format Each sample in the tar files contains 3 files: | File | Shape | Description | |------|-------|-------------| | `{key}.npy` | `(frames, 128)` float32 | DAC-VAE latent (target speech, 25 fps) | | `{key}.ref.npy` | `(ref_frames, 128)` float32 | DAC-VAE latent (speaker reference) | | `{key}.json` | - | Metadata with text, scores, speaker info | ### JSON Metadata Fields | Field | Type | Description | |-------|------|-------------| | `text` | string | Transcript text | | `annotation_scores` | dict (59 keys) | All emotion + attribute scores | | `caption` | string | Natural language description of the audio | | `target_duration` | float | Duration in seconds | | `context_duration` | float | Speaker reference duration | | `latent_frames` | int | Number of latent frames in .npy | | `ref_latent_frames` | int | Number of frames in ref.npy | | `latent_dim` | int | Always 128 | | `speaker` | string | Speaker ID | | `language` | string | Language code | | `episode_id` | int | Source episode identifier | ### Loading Example ```python import webdataset as wds import numpy as np import json dataset = wds.WebDataset("emotions/Amusement/000000.tar") for sample in dataset: key = sample["__key__"] latent = np.frombuffer(sample["npy"], dtype=np.float32).reshape(-1, 128) ref_latent = np.frombuffer(sample["ref.npy"], dtype=np.float32).reshape(-1, 128) metadata = json.loads(sample["json"]) text = metadata["text"] scores = metadata["annotation_scores"] print(f"{key}: {latent.shape}, text='{text[:60]}...', amusement={scores.get('Amusement', 0):.2f}") ``` ## Directory Structure ``` emotions/ {emotion_name}/ # 40 emotion directories 000000.tar # WebDataset tar (up to 4096 samples each) 000001.tar ... attributes/ {attribute_name}/ # 13 attribute directories bucket_{i}_{lo}_to_{hi}/ # 7 range buckets per attribute 000000.tar ... ``` ## Emotion Buckets (40 emotions) Samples are routed to the emotion with the highest score, if that score >= 2.5. Each sample appears in at most one emotion bucket. | Emotion | Samples | | Emotion | Samples | |---------|--------:|-|---------|--------:| | Interest | 10,000 | | Pain | 1,220 | | Hope_Enthusiasm_Optimism | 10,000 | | Shame | 1,215 | | Thankfulness_Gratitude | 10,000 | | Infatuation | 934 | | Amusement | 10,000 | | Malevolence_Malice | 874 | | Concentration | 10,000 | | Doubt | 507 | | Impatience_and_Irritability | 10,000 | | Disgust | 452 | | Contemplation | 10,000 | | Awe | 421 | | Anger | 10,000 | | Contentment | 376 | | Astonishment_Surprise | 10,000 | | Contempt | 348 | | Affection | 10,000 | | Teasing | 282 | | Confusion | 9,756 | | Jealousy_&_Envy | 203 | | Intoxication_Altered_States | 9,375 | | Helplessness | 194 | | Distress | 6,182 | | Embarrassment | 96 | | Fatigue_Exhaustion | 5,672 | | Bitterness | 74 | | Longing | 4,452 | | Pleasure_Ecstasy | 45 | | Relief | 3,822 | | Sourness | 27 | | Sadness | 3,634 | | | | | Elation | 3,042 | | | | | Triumph | 2,001 | | | | | Sexual_Lust | 1,967 | | | | | Emotional_Numbness | 1,721 | | | | | Fear | 1,676 | | | | | Pride | 1,374 | | | | | Disappointment | 1,329 | | | | **Total: 163,271 samples across 40 emotions** ## Attribute Buckets (13 attributes x 7 levels) Each attribute is divided into 7 linear buckets. Samples can appear in multiple attribute buckets (one per attribute). ### Age (0.0 - 6.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | 0.0 - 0.9 | 10,000 | | 1 | 0.9 - 1.7 | 10,000 | | 2 | 1.7 - 2.6 | 10,000 | | 3 | 2.6 - 3.4 | 10,000 | | 4 | 3.4 - 4.3 | 10,000 | | 5 | 4.3 - 5.1 | 1,068 | | 6 | 5.1 - 6.0 | 3 | ### Arousal (0.0 - 4.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | 0.0 - 0.6 | 10,000 | | 1 | 0.6 - 1.1 | 10,000 | | 2 | 1.1 - 1.7 | 10,000 | | 3 | 1.7 - 2.3 | 10,000 | | 4 | 2.3 - 2.9 | 10,000 | | 5 | 2.9 - 3.4 | 10,000 | | 6 | 3.4 - 4.0 | 5,402 | ### Confident vs. Hesitant (0.0 - 4.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | 0.0 - 0.6 | 10,000 | | 1 | 0.6 - 1.1 | 10,000 | | 2 | 1.1 - 1.7 | 10,000 | | 3 | 1.7 - 2.3 | 10,000 | | 4 | 2.3 - 2.9 | 10,000 | | 5 | 2.9 - 3.4 | 10,000 | | 6 | 3.4 - 4.0 | 1,376 | ### Gender (-2.0 - 2.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | -2.0 - -1.4 | 10,000 | | 1 | -1.4 - -0.9 | 10,000 | | 2 | -0.9 - -0.3 | 10,000 | | 3 | -0.3 - 0.3 | 10,000 | | 4 | 0.3 - 0.9 | 10,000 | | 5 | 0.9 - 1.4 | 10,000 | | 6 | 1.4 - 2.0 | 10,000 | ### Monotone vs. Expressive (0.0 - 4.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | 0.0 - 0.6 | 4,928 | | 1 | 0.6 - 1.1 | 10,000 | | 2 | 1.1 - 1.7 | 10,000 | | 3 | 1.7 - 2.3 | 10,000 | | 4 | 2.3 - 2.9 | 10,000 | | 5 | 2.9 - 3.4 | 10,000 | | 6 | 3.4 - 4.0 | 10,000 | ### Serious vs. Humorous (0.0 - 4.0) All 7 buckets at 10,000 samples. ### Soft vs. Harsh (-2.0 - 2.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | -2.0 - -1.4 | 349 | | 1-5 | ... | 10,000 each | | 6 | 1.4 - 2.0 | 1,176 | ### Valence (-3.0 - 3.0) All 7 buckets at 10,000 samples. ### Warm vs. Cold (-2.0 - 2.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | -2.0 - -1.4 | 420 | | 1-6 | ... | 10,000 each | ### Duration (1.0 - 30.0 seconds) All 7 buckets at 10,000 samples. ### Talking Speed (5.0 - 25.0 CPS) All 7 buckets at 10,000 samples. ### High-Pitched vs. Low-Pitched (0.0 - 4.0) | Bucket | Range | Samples | |--------|-------|--------:| | 0 | 0.0 - 0.6 | 6 | | 1-5 | ... | 10,000 each | | 6 | 3.4 - 4.0 | 10 | ### Submissive vs. Dominant (-3.0 - 3.0) | Bucket | Range | Samples | |--------|-------|--------:| | 2 | -1.3 - -0.4 | 162 | | 3 | -0.4 - 0.4 | 10,000 | | 4 | 0.4 - 1.3 | 10,000 | | 5 | 1.3 - 2.1 | 10,000 | | 6 | 2.1 - 3.0 | 86 | ## 59 Annotation Score Dimensions The `annotation_scores` field in each JSON contains all 59 dimensions from [Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus): **Emotions (40):** Affection, Amusement, Anger, Astonishment_Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional_Numbness, Fatigue_Exhaustion, Fear, Helplessness, Hope_Enthusiasm_Optimism, Impatience_and_Irritability, Infatuation, Interest, Intoxication_Altered_States_of_Consciousness, Jealousy_&_Envy, Longing, Malevolence_Malice, Pain, Pleasure_Ecstasy, Pride, Relief, Sadness, Sexual_Lust, Shame, Sourness, Teasing, Thankfulness_Gratitude, Triumph **Vocal Attributes (15):** Age, Arousal, Authenticity, Background_Noise, Confident_vs._Hesitant, Gender, High-Pitched_vs._Low-Pitched, Monotone_vs._Expressive, Recording_Quality, Serious_vs._Humorous, Soft_vs._Harsh, Submissive_vs._Dominant, Valence, Vulnerable_vs._Emotionally_Detached, Warm_vs._Cold **Audio Quality (4):** score_background_quality, score_content_enjoyment, score_overall_quality, score_speech_quality ## Bucketing Strategy **Emotion routing:** Each sample is assigned to the emotion with the highest score, if that score >= 2.5. Samples below threshold or with no strong emotion are excluded. Each sample appears in at most 1 emotion bucket. **Attribute routing:** 7 linearly-spaced buckets per attribute. Each sample is placed into all matching attribute buckets (one per attribute dimension). Bucket size capped at 10,000 samples. ## Pipeline Details - **Samples scanned:** 71,242,881 - **Processing time:** ~15 hours - **Datasets processed:** 17,468 tar files across 3 HuggingFace datasets ## Citation ```bibtex @misc{echo-conditioning-data, title={Echo TTS Emotion & Attribute Conditioning Dataset}, url={https://huggingface.co/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae}, note={Bucketed DAC-VAE latent speech data with 59-dimensional emotion/attribute annotations}, } ``` Scores from [LAION Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) and [emotion-annotations](https://github.com/LAION-AI/emotion-annotations).

提供机构：

TTS-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集