ai-music4you3/enhanced-audiosnippets-long-2-8M
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ai-music4you3/enhanced-audiosnippets-long-2-8M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- audio-classification
- text-to-speech
- automatic-speech-recognition
tags:
- speech-enhancement
- emotion-recognition
- speaker-embeddings
- voice-analysis
size_categories:
- 1M<n<10M
---
# Enhanced Audiosnippets Long 2.8M
Enhanced version of [mitermix/audiosnippets_long_2_8M](https://huggingface.co/datasets/mitermix/audiosnippets_long_2_8M) with speech enhancement, emotion annotations, speaker embeddings, and comprehensive metadata analysis.
## Dataset Summary
| Metric | Value |
|--------|-------|
| Total samples | 2,633,037 |
| Total audio hours | 4,932 h |
| Duration range | 3.0s - 1124.3s |
| Mean duration | 6.7s |
| Audio format | WAV, 48kHz mono |
| Tar files | 1,410 |
## Processing Pipeline
Each audio sample was processed through:
1. **Speech Enhancement** - [ClearerVoice MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Studio) at 48kHz. GPU-optimized pipeline with all STFT/fbank/masking/iSTFT operations on GPU.
2. **BUD-E Whisper Captions** - [laion/BUD-E-Whisper](https://huggingface.co/laion/BUD-E-Whisper) generates audio descriptions from enhanced audio.
3. **Empathic Insight Voice Plus Scores** - [laion/Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) provides 59 expert annotations (55 emotion/attribute + 4 quality scores) from Whisper encoder embeddings.
4. **Speaker Embeddings** - [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) generates 128-dimensional speaker identity vectors.
Processing was done on 8x A100-80GB GPUs in parallel.
## Repository Structure
### Audio Data (`*.tar`)
1,410 tar files containing enhanced WAV audio + JSON metadata pairs. Each tar preserves the original filename structure.
**JSON metadata fields per sample:**
| Field | Description |
|-------|-------------|
| `sample_id` | Unique identifier |
| `folder_name` | Source batch name |
| `duration` | Duration in seconds |
| `caption` | Original audio caption |
| `transcription` | Original ASR transcription |
| `emotion_vector` | 55-char emotion encoding string |
| `detailed_caption` | Detailed emotion narrative |
| `bude_whisper_caption` | BUD-E Whisper description of enhanced audio |
| `empathic_insight_scores` | Dict of 59 continuous scores (see below) |
| `speaker_embedding` | 128-dim speaker identity vector |
| `enhancement_model` | "MossFormer2_SE_48K" |
| `enhanced_sample_rate` | 48000 |
| 40 emotion fields | Original integer annotations (0-4 scale) |
| 15 attribute fields | Original integer annotations |
### Metadata (`metadata/`)
Pre-extracted metadata in Parquet format for efficient analysis without downloading audio.
#### `metadata/merged/all_metadata.parquet`
Single master file with all 2,633,037 samples. Columns include all JSON fields plus:
- `eis_*` - Empathic Insight scores (59 columns, continuous float values)
- `spk_0` through `spk_127` - Speaker embedding dimensions
- `tar_file` - Source tar filename
- `audio_filename` - Audio filename within tar
- `tar_url` - Direct URL to download the tar
#### `metadata/parquet_per_tar/`
1,410 individual parquet files (one per tar) for incremental processing.
#### `metadata/indices/`
Pre-built [FAISS](https://github.com/facebookresearch/faiss) indices for nearest-neighbor retrieval:
| File | Description |
|------|-------------|
| `speaker_embeddings.index` | FAISS IndexFlatIP on L2-normalized 128-dim speaker embeddings (cosine similarity). 2,633,036 vectors. |
| `speaker_embeddings_ids.npy` | Row indices mapping FAISS positions to master parquet rows |
| `emotion_scores.index` | FAISS IndexFlatIP on L2-normalized 40-dim emotion score vectors (cosine similarity). 2,633,037 vectors. |
| `emotion_scores_ids.npy` | Row indices mapping FAISS positions to master parquet rows |
**Usage example:**
```python
import faiss
import numpy as np
import pandas as pd
# Load index and metadata
index = faiss.read_index("metadata/indices/speaker_embeddings.index")
ids = np.load("metadata/indices/speaker_embeddings_ids.npy")
df = pd.read_parquet("metadata/merged/all_metadata.parquet")
# Query: find 10 most similar speakers to sample 0
query = df.iloc[0][[f"spk_{i}" for i in range(128)]].values.astype(np.float32)
query = query / np.linalg.norm(query) # normalize for cosine similarity
distances, indices = index.search(query.reshape(1, -1), 10)
# Map FAISS indices back to dataframe rows
similar_samples = df.iloc[ids[indices[0]]]
print(similar_samples[["sample_id", "tar_file", "audio_filename"]])
```
### Emotion Subsets (`metadata/emotion_subsets/`)
40 parquet files, one per emotion category. Each contains up to 5,000 samples with the highest Empathic Insight scores (threshold: >= 2.0) for that emotion.
Each subset includes **speaker reference lookups** - for every sample, pre-computed nearest neighbors in speaker embedding space:
| Reference Column | Description |
|-----------------|-------------|
| `ref_most_similar` | JSON: nearest speaker match (highest cosine similarity, excluding near-duplicates >= 0.99) |
| `ref_similar_emotions` | JSON: speaker match with most similar emotion profile |
| `ref_dissimilar_emotions` | JSON: speaker match with most dissimilar emotion profile (same voice, different mood) |
Each reference contains: `sample_id`, `tar_file`, `audio_filename`, `tar_url`, `speaker_similarity`.
**Emotion categories and sample counts:**
| Emotion | Samples | | Emotion | Samples |
|---------|--------:|-|---------|--------:|
| Affection | 5,000 | | Interest | 5,000 |
| Amusement | 5,000 | | Intoxication/Altered States | 5,000 |
| Anger | 5,000 | | Jealousy & Envy | 3,706 |
| Astonishment/Surprise | 5,000 | | Longing | 4,600 |
| Awe | 2,791 | | Malevolence/Malice | 2,484 |
| Bitterness | 969 | | Pain | 5,000 |
| Concentration | 5,000 | | Pleasure/Ecstasy | 2,890 |
| Confusion | 4,568 | | Pride | 4,741 |
| Contemplation | 5,000 | | Relief | 5,000 |
| Contempt | 5,000 | | Sadness | 5,000 |
| Contentment | 2,180 | | Sexual Lust | 5,000 |
| Disappointment | 5,000 | | Shame | 3,006 |
| Disgust | 637 | | Sourness | 351 |
| Distress | 5,000 | | Teasing | 886 |
| Doubt | 239 | | Thankfulness/Gratitude | 5,000 |
| Elation | 5,000 | | Triumph | 5,000 |
| Embarrassment | 457 | | Fatigue/Exhaustion | 5,000 |
| Emotional Numbness | 1,079 | | Helplessness | 5,000 |
| Fear | 3,353 | | Hope/Enthusiasm/Optimism | 5,000 |
| Impatience/Irritability | 5,000 | | Infatuation | 5,000 |
**Usage example:**
```python
import pandas as pd, json
df = pd.read_parquet("metadata/emotion_subsets/Sadness.parquet")
sample = df.iloc[0]
# Get the most similar speaker with different emotions
ref = json.loads(sample["ref_dissimilar_emotions"])
print(f"Same voice, different mood: {ref['tar_file']}/{ref['audio_filename']}")
print(f"Speaker similarity: {ref['speaker_similarity']}")
```
### Attribute Buckets (`metadata/attribute_subsets/`)
100 parquet files covering 15 voice/audio attribute dimensions, each divided into 7 equal-range buckets with up to 2,000 randomly sampled files per bucket. Each file also includes speaker reference lookups.
**Attribute dimensions:**
| Attribute | Range | Description |
|-----------|-------|-------------|
| Valence | -3 to 3 | Emotional positivity/negativity |
| Arousal | 0 to 4 | Energy/activation level |
| Submissive vs. Dominant | -3 to 3 | Assertiveness spectrum |
| Age | 0 to 6 | Perceived speaker age |
| Gender | -2 to 2 | Perceived gender expression |
| Serious vs. Humorous | 0 to 4 | Tone spectrum |
| Vulnerable vs. Emotionally Detached | 0 to 4 | Emotional openness |
| Confident vs. Hesitant | 0 to 4 | Confidence spectrum |
| Warm vs. Cold | -2 to 2 | Interpersonal warmth |
| Monotone vs. Expressive | 0 to 4 | Prosodic variation |
| High-Pitched vs. Low-Pitched | 0 to 4 | Pitch range |
| Soft vs. Harsh | -2 to 2 | Voice texture |
| Authenticity | 0 to 4 | Perceived genuineness |
| Recording Quality | 0 to 4 | Audio fidelity |
| Background Noise | 0 to 3 | Noise level |
**Filename format:** `{Attribute}_bucket{N}_{low}_to_{high}.parquet`
**Usage example:**
```python
# Get very expressive speakers
df = pd.read_parquet("metadata/attribute_subsets/Monotone_vs._Expressive_bucket6_3.4_to_4.0.parquet")
print(f"{len(df)} highly expressive samples")
```
## Empathic Insight Scores (59 dimensions)
The `empathic_insight_scores` field (and `eis_*` parquet columns) contains continuous float predictions from 59 MLP experts trained on Whisper encoder embeddings:
### 40 Emotion Scores
Amusement, Elation, Pleasure/Ecstasy, Contentment, Thankfulness/Gratitude, Affection, Infatuation, Hope/Enthusiasm/Optimism, Triumph, Pride, Interest, Awe, Astonishment/Surprise, Concentration, Contemplation, Relief, Longing, Teasing, Impatience and Irritability, Sexual Lust, Doubt, Fear, Distress, Confusion, Embarrassment, Shame, Disappointment, Sadness, Bitterness, Contempt, Disgust, Anger, Malevolence/Malice, Sourness, Pain, Helplessness, Fatigue/Exhaustion, Emotional Numbness, Intoxication/Altered States, Jealousy & Envy
### 15 Attribute Scores
Valence, Arousal, Submissive vs. Dominant, Age, Gender, Serious vs. Humorous, Vulnerable vs. Emotionally Detached, Confident vs. Hesitant, Warm vs. Cold, Monotone vs. Expressive, High-Pitched vs. Low-Pitched, Soft vs. Harsh, Authenticity, Recording Quality, Background Noise
### 4 Quality Scores
`score_overall_quality`, `score_speech_quality`, `score_background_quality`, `score_content_enjoyment`
## Speaker Reference System
The emotion subset and attribute bucket parquet files include pre-computed speaker references that enable finding voice-similar samples with different emotional/attribute profiles. For each sample, three references are provided:
1. **`ref_most_similar`** - The closest speaker match by cosine similarity in the 128-dim speaker embedding space (excluding near-duplicates with similarity >= 0.99).
2. **`ref_similar_emotions`** - Among speaker-similar candidates (cosine >= 0.9, or top 10 if none reach 0.9), the one with the most similar emotion profile (highest cosine similarity across 40 emotion dimensions).
3. **`ref_dissimilar_emotions`** - Among speaker-similar candidates, the one with the most different emotion profile. This enables finding the **same voice expressing different emotions** - useful for emotion transfer, voice conversion, and contrastive learning.
Each reference is a JSON string containing:
```json
{
"sample_id": "batch100_part0_...",
"tar_file": "batch100_part0.tar",
"audio_filename": "batch100_part0_..._chunk_1893_1_1552911.wav",
"tar_url": "https://huggingface.co/datasets/ai-music4you3/enhanced-audiosnippets-long-2-8M/resolve/main/batch100_part0.tar",
"speaker_similarity": 0.9523
}
```
## Source Dataset
Enhanced from [mitermix/audiosnippets_long_2_8M](https://huggingface.co/datasets/mitermix/audiosnippets_long_2_8M). Original metadata (captions, transcriptions, emotion vectors, detailed captions) is preserved alongside new annotations.
## Models Used
| Model | Purpose | Reference |
|-------|---------|-----------|
| [ClearerVoice MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Studio) | Speech Enhancement (48kHz) | MossFormer2 |
| [laion/BUD-E-Whisper](https://huggingface.co/laion/BUD-E-Whisper) | Audio captioning | Whisper-based |
| [laion/Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) | 59 emotion/quality scores | MLP on Whisper embeddings |
| [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) | 128-dim speaker embeddings | WavLM-based |
提供机构:
ai-music4you3



