TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave

Name: TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave
Creator: TTS-AGI
Published: 2026-03-22 21:31:10
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-to-speech - audio-classification tags: - emotion - voice-attributes - dacvae - speech - tts - audio pretty_name: "Emotion and Voice Attribute Reference Snippets DACVAE and Wave" size_categories: - 100K<n<1M --- # Emotion and Voice Attribute Reference Snippets - DACVAE and Wave Merged dataset combining **TTS-AGI/enhanced-emo-snippets-balanced-DACVAE** and **TTS-AGI/emotion-attribute-conditioning-dacvae** with decoded WAV audio. ## Overview - **Total samples**: 606,178 - **Filtered out**: 363,331 (samples with `speech_quality < 1.8`) - **Total tar files**: 328 - **Total size**: 1.54 TB - **Audio format**: WAV, 48kHz, PCM 16-bit mono - **Latents**: DAC-VAE float16 `[T, 128]` at 25 frames/sec - **Dimensions**: 57 (40 emotions + 15 voice attributes + 2 additional attributes) ## File Structure Each tar file is named `{Dimension}_{bucket_range}.tar` and contains WebDataset-formatted samples: ``` {key}.json # Full metadata (scores, text, captions, etc.) {key}.target.npy # DACVAE latent for target speech [T, 128] float16 {key}.target.wav # Decoded target audio (48kHz WAV) {key}.ref.npy # DACVAE latent for speaker reference [T, 128] float16 (if available) {key}.ref.wav # Decoded reference audio (48kHz WAV) (if available) ``` Samples prefixed with `emo_` come from DS1 (enhanced-emo-snippets-balanced), samples prefixed with `cond_` come from DS2 (emotion-attribute-conditioning). DS2 samples include speaker reference audio (`.ref.npy` / `.ref.wav`), while DS1 samples include speaker embeddings in the JSON metadata. ## Dimensions ### Emotions (40) | Dimension | Buckets | Tar Files | |-----------|---------|-----------| | Affection | [0,1) to [4,5) | 5 | | Amusement | [0,1) to [4,5) | 5 | | Anger | [0,1) to [5,6) | 6 | | Astonishment_Surprise | [0,1) to [4,5) | 5 | | Awe | [0,1) to [4,5) | 5 | | Bitterness | [0,1) to [4,5) | 5 | | Concentration | [0,1) to [4,5) | 5 | | Confusion | [0,1) to [4,5) | 5 | | Contemplation | [0,1) to [3,4) | 4 | | Contempt | [0,1) to [4,5) | 5 | | Contentment | [0,1) to [3,4) | 4 | | Disappointment | [0,1) to [4,5) | 5 | | Disgust | [0,1) to [3,4) | 4 | | Distress | [0,1) to [4,5) | 5 | | Doubt | [0,1) to [4,5) | 5 | | Elation | [0,1) to [5,6) | 6 | | Embarrassment | [0,1) to [2,3) | 3 | | Emotional_Numbness | [0,1) to [3,4) | 4 | | Fatigue_Exhaustion | [1,2) to [4,5) | 4 | | Fear | [0,1) to [3,4) | 4 | | Helplessness | [0,1) to [3,4) | 4 | | Hope_Enthusiasm_Optimism | [0,1) to [6,7) | 7 | | Impatience_and_Irritability | [0,1) to [4,5) | 5 | | Infatuation | [0,1) to [4,5) | 5 | | Interest | [0,1) to [3,4) | 4 | | Intoxication_Altered_States_of_Consciousness | [0,1) to [4,5) | 5 | | Jealousy_and_Envy | [0,1) to [4,5) | 5 | | Longing | [0,1) to [3,4) | 4 | | Malevolence_Malice | [0,1) to [3,4) | 4 | | Pain | [0,1) to [5,6) | 6 | | Pleasure_Ecstasy | [0,1) to [3,4) | 4 | | Pride | [0,1) to [4,5) | 5 | | Relief | [0,1) to [5,6) | 6 | | Sadness | [0,1) to [4,5) | 5 | | Sexual_Lust | [0,1) to [4,5) | 5 | | Shame | [0,1) to [5,6) | 6 | | Sourness | [0,1) to [3,4) | 4 | | Teasing | [0,1) to [3,4) | 4 | | Thankfulness_Gratitude | [0,1) to [4,5) | 5 | | Triumph | [0,1) to [4,5) | 5 | ### Voice Attributes (15 from DS1 + 2 from DS2) Attributes from DS1 use integer bucket ranges. Attributes from DS2 use float-valued bucket ranges derived from the conditioning pipeline. | Dimension | Bucket Type | Tar Files | |-----------|-------------|-----------| | Age | Integer [0,6) + Float [0.00, 5.14) | 12 | | Arousal | Integer [0,6) + Float [0.00, 4.00) | 13 | | Authenticity | Integer [1,5) | 4 | | Background_Noise | Integer [0,3) | 3 | | Confident_vs._Hesitant | Integer [0,5) + Float [0.00, 4.00) | 12 | | Gender | Integer [0,3) + Float [0.29, 2.00) | 6 | | High-Pitched_vs._Low-Pitched | Integer [0,5) + Float [0.00, 3.43) | 11 | | Monotone_vs._Expressive | Integer [0,5) + Float [0.00, 4.00) | 12 | | Recording_Quality | Integer [0,5) | 5 | | Serious_vs._Humorous | Integer [0,6) + Float [0.00, 4.00) | 13 | | Soft_vs._Harsh | Integer [0,2) + Float [0.29, 2.00) | 5 | | Submissive_vs._Dominant | Integer [0,3) + Float [0.43, 3.00) | 6 | | Valence | Integer [0,4) + Float [0.43, 3.00) | 7 | | Vulnerable_vs._Emotionally_Detached | Integer [0,5) | 5 | | Warm_vs._Cold | Integer [0,3) + Float [0.29, 2.00) | 6 | | duration | Float [1.00, 30.00) | 7 | | talking_speed | Float [5.00, 25.00) | 7 | ## Metadata Fields Each sample's `.json` contains: **From DS1 (enhanced-emo-snippets-balanced):** - `transcription` — Speech transcript - `caption`, `detailed_caption`, `bude_whisper_caption` — Natural language audio descriptions - `empathic_insight_scores` — 59 float scores (40 emotions + 15 attributes + 4 quality) - `speaker_embedding` — 128-dim speaker embedding vector - `emotion_vector` — Encoded emotion vector - `enhancement_model` — Speech enhancement model used (`MossFormer2_SE_48K`) - `duration` — Audio duration in seconds **From DS2 (emotion-attribute-conditioning):** - `text` — Speech transcript - `caption` — Natural language audio description - `annotation_scores` — 59 float scores (same dimensions as DS1) - `target_duration`, `context_duration` — Target and reference durations - `speaker`, `language` — Speaker ID and language code **Added by merge pipeline:** - `_source_dataset` — `"enhanced-emo-snippets-balanced"` or `"emotion-attribute-conditioning"` - `_dimension` — The emotion/attribute dimension name - `_bucket` — The bucket label - `has_reference` — Whether reference audio is available ## Quality Scores All samples include 59 annotation scores from [Empathic Insight Voice Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus): - **40 emotion scores**: Amusement, Anger, Fear, Sadness, etc. - **15 attribute scores**: Valence, Arousal, Age, Gender, etc. - **4 quality scores**: `score_overall_quality`, `score_speech_quality`, `score_content_enjoyment`, `score_background_quality` Only samples with `score_speech_quality >= 1.8` are included in this dataset. ## Sources - **DS1**: [TTS-AGI/enhanced-emo-snippets-balanced-DACVAE](https://huggingface.co/datasets/TTS-AGI/enhanced-emo-snippets-balanced-DACVAE) — Quality-ranked emotion/attribute snippets with speech enhancement - **DS2**: [TTS-AGI/emotion-attribute-conditioning-dacvae](https://huggingface.co/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae) — Emotion/attribute conditioning pairs with speaker references - **DACVAE**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) — DAC-VAE model for audio codec ## Usage ```python import webdataset as wds import numpy as np import json, io, soundfile as sf url = "https://huggingface.co/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave/resolve/main/data/Anger_4to5.tar" ds = wds.WebDataset(url).decode() for sample in ds: meta = json.loads(sample["json"]) target_wav = sample["target.wav"] # decoded 48kHz audio target_latent = np.load(io.BytesIO(sample["target.npy"])) # [T, 128] float16 if "ref.wav" in sample: ref_wav = sample["ref.wav"] # speaker reference audio ref_latent = np.load(io.BytesIO(sample["ref.npy"])) # [T, 128] float16 # Access emotion scores scores = meta.get("empathic_insight_scores") or meta.get("annotation_scores", {}) speech_quality = scores.get("score_speech_quality", 0) anger_score = scores.get("Anger", 0) ``` ## DACVAE Encode/Decode Audio was decoded from DAC-VAE latents at 48kHz, 25 latent frames/sec: ```python import torch from dacvae import DACVAE from huggingface_hub import hf_hub_download model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval() # Decode: latent -> audio z = torch.from_numpy(latent.T).unsqueeze(0).float().cuda() # [1, 128, T_latent] audio_48k = model.decode(z).squeeze().cpu() # [T_audio] at 48kHz # Encode: audio -> latent audio = torch.from_numpy(wav).unsqueeze(0).unsqueeze(0).float().cuda() # [1, 1, T_audio] z_encoded = model.encode(audio) # [1, 128, T_latent] latent = z_encoded.squeeze(0).T.cpu().half().numpy() # [T_latent, 128] float16 ```

提供机构：

TTS-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集