TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-speech
- audio-classification
tags:
- emotion
- voice-attributes
- dacvae
- speech
- tts
- audio
pretty_name: "Emotion and Voice Attribute Reference Snippets DACVAE and Wave"
size_categories:
- 100K<n<1M
---
# Emotion and Voice Attribute Reference Snippets - DACVAE and Wave
Merged dataset combining **TTS-AGI/enhanced-emo-snippets-balanced-DACVAE** and
**TTS-AGI/emotion-attribute-conditioning-dacvae** with decoded WAV audio.
## Overview
- **Total samples**: 606,178
- **Filtered out**: 363,331 (samples with `speech_quality < 1.8`)
- **Total tar files**: 328
- **Total size**: 1.54 TB
- **Audio format**: WAV, 48kHz, PCM 16-bit mono
- **Latents**: DAC-VAE float16 `[T, 128]` at 25 frames/sec
- **Dimensions**: 57 (40 emotions + 15 voice attributes + 2 additional attributes)
## File Structure
Each tar file is named `{Dimension}_{bucket_range}.tar` and contains WebDataset-formatted samples:
```
{key}.json # Full metadata (scores, text, captions, etc.)
{key}.target.npy # DACVAE latent for target speech [T, 128] float16
{key}.target.wav # Decoded target audio (48kHz WAV)
{key}.ref.npy # DACVAE latent for speaker reference [T, 128] float16 (if available)
{key}.ref.wav # Decoded reference audio (48kHz WAV) (if available)
```
Samples prefixed with `emo_` come from DS1 (enhanced-emo-snippets-balanced), samples prefixed with `cond_` come from DS2 (emotion-attribute-conditioning). DS2 samples include speaker reference audio (`.ref.npy` / `.ref.wav`), while DS1 samples include speaker embeddings in the JSON metadata.
## Dimensions
### Emotions (40)
| Dimension | Buckets | Tar Files |
|-----------|---------|-----------|
| Affection | [0,1) to [4,5) | 5 |
| Amusement | [0,1) to [4,5) | 5 |
| Anger | [0,1) to [5,6) | 6 |
| Astonishment_Surprise | [0,1) to [4,5) | 5 |
| Awe | [0,1) to [4,5) | 5 |
| Bitterness | [0,1) to [4,5) | 5 |
| Concentration | [0,1) to [4,5) | 5 |
| Confusion | [0,1) to [4,5) | 5 |
| Contemplation | [0,1) to [3,4) | 4 |
| Contempt | [0,1) to [4,5) | 5 |
| Contentment | [0,1) to [3,4) | 4 |
| Disappointment | [0,1) to [4,5) | 5 |
| Disgust | [0,1) to [3,4) | 4 |
| Distress | [0,1) to [4,5) | 5 |
| Doubt | [0,1) to [4,5) | 5 |
| Elation | [0,1) to [5,6) | 6 |
| Embarrassment | [0,1) to [2,3) | 3 |
| Emotional_Numbness | [0,1) to [3,4) | 4 |
| Fatigue_Exhaustion | [1,2) to [4,5) | 4 |
| Fear | [0,1) to [3,4) | 4 |
| Helplessness | [0,1) to [3,4) | 4 |
| Hope_Enthusiasm_Optimism | [0,1) to [6,7) | 7 |
| Impatience_and_Irritability | [0,1) to [4,5) | 5 |
| Infatuation | [0,1) to [4,5) | 5 |
| Interest | [0,1) to [3,4) | 4 |
| Intoxication_Altered_States_of_Consciousness | [0,1) to [4,5) | 5 |
| Jealousy_and_Envy | [0,1) to [4,5) | 5 |
| Longing | [0,1) to [3,4) | 4 |
| Malevolence_Malice | [0,1) to [3,4) | 4 |
| Pain | [0,1) to [5,6) | 6 |
| Pleasure_Ecstasy | [0,1) to [3,4) | 4 |
| Pride | [0,1) to [4,5) | 5 |
| Relief | [0,1) to [5,6) | 6 |
| Sadness | [0,1) to [4,5) | 5 |
| Sexual_Lust | [0,1) to [4,5) | 5 |
| Shame | [0,1) to [5,6) | 6 |
| Sourness | [0,1) to [3,4) | 4 |
| Teasing | [0,1) to [3,4) | 4 |
| Thankfulness_Gratitude | [0,1) to [4,5) | 5 |
| Triumph | [0,1) to [4,5) | 5 |
### Voice Attributes (15 from DS1 + 2 from DS2)
Attributes from DS1 use integer bucket ranges. Attributes from DS2 use float-valued bucket ranges derived from the conditioning pipeline.
| Dimension | Bucket Type | Tar Files |
|-----------|-------------|-----------|
| Age | Integer [0,6) + Float [0.00, 5.14) | 12 |
| Arousal | Integer [0,6) + Float [0.00, 4.00) | 13 |
| Authenticity | Integer [1,5) | 4 |
| Background_Noise | Integer [0,3) | 3 |
| Confident_vs._Hesitant | Integer [0,5) + Float [0.00, 4.00) | 12 |
| Gender | Integer [0,3) + Float [0.29, 2.00) | 6 |
| High-Pitched_vs._Low-Pitched | Integer [0,5) + Float [0.00, 3.43) | 11 |
| Monotone_vs._Expressive | Integer [0,5) + Float [0.00, 4.00) | 12 |
| Recording_Quality | Integer [0,5) | 5 |
| Serious_vs._Humorous | Integer [0,6) + Float [0.00, 4.00) | 13 |
| Soft_vs._Harsh | Integer [0,2) + Float [0.29, 2.00) | 5 |
| Submissive_vs._Dominant | Integer [0,3) + Float [0.43, 3.00) | 6 |
| Valence | Integer [0,4) + Float [0.43, 3.00) | 7 |
| Vulnerable_vs._Emotionally_Detached | Integer [0,5) | 5 |
| Warm_vs._Cold | Integer [0,3) + Float [0.29, 2.00) | 6 |
| duration | Float [1.00, 30.00) | 7 |
| talking_speed | Float [5.00, 25.00) | 7 |
## Metadata Fields
Each sample's `.json` contains:
**From DS1 (enhanced-emo-snippets-balanced):**
- `transcription` — Speech transcript
- `caption`, `detailed_caption`, `bude_whisper_caption` — Natural language audio descriptions
- `empathic_insight_scores` — 59 float scores (40 emotions + 15 attributes + 4 quality)
- `speaker_embedding` — 128-dim speaker embedding vector
- `emotion_vector` — Encoded emotion vector
- `enhancement_model` — Speech enhancement model used (`MossFormer2_SE_48K`)
- `duration` — Audio duration in seconds
**From DS2 (emotion-attribute-conditioning):**
- `text` — Speech transcript
- `caption` — Natural language audio description
- `annotation_scores` — 59 float scores (same dimensions as DS1)
- `target_duration`, `context_duration` — Target and reference durations
- `speaker`, `language` — Speaker ID and language code
**Added by merge pipeline:**
- `_source_dataset` — `"enhanced-emo-snippets-balanced"` or `"emotion-attribute-conditioning"`
- `_dimension` — The emotion/attribute dimension name
- `_bucket` — The bucket label
- `has_reference` — Whether reference audio is available
## Quality Scores
All samples include 59 annotation scores from [Empathic Insight Voice Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus):
- **40 emotion scores**: Amusement, Anger, Fear, Sadness, etc.
- **15 attribute scores**: Valence, Arousal, Age, Gender, etc.
- **4 quality scores**: `score_overall_quality`, `score_speech_quality`, `score_content_enjoyment`, `score_background_quality`
Only samples with `score_speech_quality >= 1.8` are included in this dataset.
## Sources
- **DS1**: [TTS-AGI/enhanced-emo-snippets-balanced-DACVAE](https://huggingface.co/datasets/TTS-AGI/enhanced-emo-snippets-balanced-DACVAE) — Quality-ranked emotion/attribute snippets with speech enhancement
- **DS2**: [TTS-AGI/emotion-attribute-conditioning-dacvae](https://huggingface.co/datasets/TTS-AGI/emotion-attribute-conditioning-dacvae) — Emotion/attribute conditioning pairs with speaker references
- **DACVAE**: [mrfakename/dacvae-watermarked](https://huggingface.co/mrfakename/dacvae-watermarked) — DAC-VAE model for audio codec
## Usage
```python
import webdataset as wds
import numpy as np
import json, io, soundfile as sf
url = "https://huggingface.co/datasets/TTS-AGI/Emotion-Voice-Attribute-Reference-Snippets-DACVAE-Wave/resolve/main/data/Anger_4to5.tar"
ds = wds.WebDataset(url).decode()
for sample in ds:
meta = json.loads(sample["json"])
target_wav = sample["target.wav"] # decoded 48kHz audio
target_latent = np.load(io.BytesIO(sample["target.npy"])) # [T, 128] float16
if "ref.wav" in sample:
ref_wav = sample["ref.wav"] # speaker reference audio
ref_latent = np.load(io.BytesIO(sample["ref.npy"])) # [T, 128] float16
# Access emotion scores
scores = meta.get("empathic_insight_scores") or meta.get("annotation_scores", {})
speech_quality = scores.get("score_speech_quality", 0)
anger_score = scores.get("Anger", 0)
```
## DACVAE Encode/Decode
Audio was decoded from DAC-VAE latents at 48kHz, 25 latent frames/sec:
```python
import torch
from dacvae import DACVAE
from huggingface_hub import hf_hub_download
model = DACVAE.load(hf_hub_download("mrfakename/dacvae-watermarked", "weights.pth")).cuda().eval()
# Decode: latent -> audio
z = torch.from_numpy(latent.T).unsqueeze(0).float().cuda() # [1, 128, T_latent]
audio_48k = model.decode(z).squeeze().cpu() # [T_audio] at 48kHz
# Encode: audio -> latent
audio = torch.from_numpy(wav).unsqueeze(0).unsqueeze(0).float().cuda() # [1, 1, T_audio]
z_encoded = model.encode(audio) # [1, 128, T_latent]
latent = z_encoded.squeeze(0).T.cpu().half().numpy() # [T_latent, 128] float16
```
提供机构:
TTS-AGI



