ai-music4you3/enhanced-audiosnippets-long-2-8M

Name: ai-music4you3/enhanced-audiosnippets-long-2-8M
Creator: ai-music4you3
Published: 2026-03-17 13:44:57
License: 暂无描述

Hugging Face2026-03-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ai-music4you3/enhanced-audiosnippets-long-2-8M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - audio-classification - text-to-speech - automatic-speech-recognition tags: - speech-enhancement - emotion-recognition - speaker-embeddings - voice-analysis size_categories: - 1M<n<10M --- # Enhanced Audiosnippets Long 2.8M Enhanced version of [mitermix/audiosnippets_long_2_8M](https://huggingface.co/datasets/mitermix/audiosnippets_long_2_8M) with speech enhancement, emotion annotations, speaker embeddings, and comprehensive metadata analysis. ## Dataset Summary | Metric | Value | |--------|-------| | Total samples | 2,633,037 | | Total audio hours | 4,932 h | | Duration range | 3.0s - 1124.3s | | Mean duration | 6.7s | | Audio format | WAV, 48kHz mono | | Tar files | 1,410 | ## Processing Pipeline Each audio sample was processed through: 1. **Speech Enhancement** - [ClearerVoice MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Studio) at 48kHz. GPU-optimized pipeline with all STFT/fbank/masking/iSTFT operations on GPU. 2. **BUD-E Whisper Captions** - [laion/BUD-E-Whisper](https://huggingface.co/laion/BUD-E-Whisper) generates audio descriptions from enhanced audio. 3. **Empathic Insight Voice Plus Scores** - [laion/Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) provides 59 expert annotations (55 emotion/attribute + 4 quality scores) from Whisper encoder embeddings. 4. **Speaker Embeddings** - [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) generates 128-dimensional speaker identity vectors. Processing was done on 8x A100-80GB GPUs in parallel. ## Repository Structure ### Audio Data (`*.tar`) 1,410 tar files containing enhanced WAV audio + JSON metadata pairs. Each tar preserves the original filename structure. **JSON metadata fields per sample:** | Field | Description | |-------|-------------| | `sample_id` | Unique identifier | | `folder_name` | Source batch name | | `duration` | Duration in seconds | | `caption` | Original audio caption | | `transcription` | Original ASR transcription | | `emotion_vector` | 55-char emotion encoding string | | `detailed_caption` | Detailed emotion narrative | | `bude_whisper_caption` | BUD-E Whisper description of enhanced audio | | `empathic_insight_scores` | Dict of 59 continuous scores (see below) | | `speaker_embedding` | 128-dim speaker identity vector | | `enhancement_model` | "MossFormer2_SE_48K" | | `enhanced_sample_rate` | 48000 | | 40 emotion fields | Original integer annotations (0-4 scale) | | 15 attribute fields | Original integer annotations | ### Metadata (`metadata/`) Pre-extracted metadata in Parquet format for efficient analysis without downloading audio. #### `metadata/merged/all_metadata.parquet` Single master file with all 2,633,037 samples. Columns include all JSON fields plus: - `eis_*` - Empathic Insight scores (59 columns, continuous float values) - `spk_0` through `spk_127` - Speaker embedding dimensions - `tar_file` - Source tar filename - `audio_filename` - Audio filename within tar - `tar_url` - Direct URL to download the tar #### `metadata/parquet_per_tar/` 1,410 individual parquet files (one per tar) for incremental processing. #### `metadata/indices/` Pre-built [FAISS](https://github.com/facebookresearch/faiss) indices for nearest-neighbor retrieval: | File | Description | |------|-------------| | `speaker_embeddings.index` | FAISS IndexFlatIP on L2-normalized 128-dim speaker embeddings (cosine similarity). 2,633,036 vectors. | | `speaker_embeddings_ids.npy` | Row indices mapping FAISS positions to master parquet rows | | `emotion_scores.index` | FAISS IndexFlatIP on L2-normalized 40-dim emotion score vectors (cosine similarity). 2,633,037 vectors. | | `emotion_scores_ids.npy` | Row indices mapping FAISS positions to master parquet rows | **Usage example:** ```python import faiss import numpy as np import pandas as pd # Load index and metadata index = faiss.read_index("metadata/indices/speaker_embeddings.index") ids = np.load("metadata/indices/speaker_embeddings_ids.npy") df = pd.read_parquet("metadata/merged/all_metadata.parquet") # Query: find 10 most similar speakers to sample 0 query = df.iloc[0][[f"spk_{i}" for i in range(128)]].values.astype(np.float32) query = query / np.linalg.norm(query) # normalize for cosine similarity distances, indices = index.search(query.reshape(1, -1), 10) # Map FAISS indices back to dataframe rows similar_samples = df.iloc[ids[indices[0]]] print(similar_samples[["sample_id", "tar_file", "audio_filename"]]) ``` ### Emotion Subsets (`metadata/emotion_subsets/`) 40 parquet files, one per emotion category. Each contains up to 5,000 samples with the highest Empathic Insight scores (threshold: >= 2.0) for that emotion. Each subset includes **speaker reference lookups** - for every sample, pre-computed nearest neighbors in speaker embedding space: | Reference Column | Description | |-----------------|-------------| | `ref_most_similar` | JSON: nearest speaker match (highest cosine similarity, excluding near-duplicates >= 0.99) | | `ref_similar_emotions` | JSON: speaker match with most similar emotion profile | | `ref_dissimilar_emotions` | JSON: speaker match with most dissimilar emotion profile (same voice, different mood) | Each reference contains: `sample_id`, `tar_file`, `audio_filename`, `tar_url`, `speaker_similarity`. **Emotion categories and sample counts:** | Emotion | Samples | | Emotion | Samples | |---------|--------:|-|---------|--------:| | Affection | 5,000 | | Interest | 5,000 | | Amusement | 5,000 | | Intoxication/Altered States | 5,000 | | Anger | 5,000 | | Jealousy & Envy | 3,706 | | Astonishment/Surprise | 5,000 | | Longing | 4,600 | | Awe | 2,791 | | Malevolence/Malice | 2,484 | | Bitterness | 969 | | Pain | 5,000 | | Concentration | 5,000 | | Pleasure/Ecstasy | 2,890 | | Confusion | 4,568 | | Pride | 4,741 | | Contemplation | 5,000 | | Relief | 5,000 | | Contempt | 5,000 | | Sadness | 5,000 | | Contentment | 2,180 | | Sexual Lust | 5,000 | | Disappointment | 5,000 | | Shame | 3,006 | | Disgust | 637 | | Sourness | 351 | | Distress | 5,000 | | Teasing | 886 | | Doubt | 239 | | Thankfulness/Gratitude | 5,000 | | Elation | 5,000 | | Triumph | 5,000 | | Embarrassment | 457 | | Fatigue/Exhaustion | 5,000 | | Emotional Numbness | 1,079 | | Helplessness | 5,000 | | Fear | 3,353 | | Hope/Enthusiasm/Optimism | 5,000 | | Impatience/Irritability | 5,000 | | Infatuation | 5,000 | **Usage example:** ```python import pandas as pd, json df = pd.read_parquet("metadata/emotion_subsets/Sadness.parquet") sample = df.iloc[0] # Get the most similar speaker with different emotions ref = json.loads(sample["ref_dissimilar_emotions"]) print(f"Same voice, different mood: {ref['tar_file']}/{ref['audio_filename']}") print(f"Speaker similarity: {ref['speaker_similarity']}") ``` ### Attribute Buckets (`metadata/attribute_subsets/`) 100 parquet files covering 15 voice/audio attribute dimensions, each divided into 7 equal-range buckets with up to 2,000 randomly sampled files per bucket. Each file also includes speaker reference lookups. **Attribute dimensions:** | Attribute | Range | Description | |-----------|-------|-------------| | Valence | -3 to 3 | Emotional positivity/negativity | | Arousal | 0 to 4 | Energy/activation level | | Submissive vs. Dominant | -3 to 3 | Assertiveness spectrum | | Age | 0 to 6 | Perceived speaker age | | Gender | -2 to 2 | Perceived gender expression | | Serious vs. Humorous | 0 to 4 | Tone spectrum | | Vulnerable vs. Emotionally Detached | 0 to 4 | Emotional openness | | Confident vs. Hesitant | 0 to 4 | Confidence spectrum | | Warm vs. Cold | -2 to 2 | Interpersonal warmth | | Monotone vs. Expressive | 0 to 4 | Prosodic variation | | High-Pitched vs. Low-Pitched | 0 to 4 | Pitch range | | Soft vs. Harsh | -2 to 2 | Voice texture | | Authenticity | 0 to 4 | Perceived genuineness | | Recording Quality | 0 to 4 | Audio fidelity | | Background Noise | 0 to 3 | Noise level | **Filename format:** `{Attribute}_bucket{N}_{low}_to_{high}.parquet` **Usage example:** ```python # Get very expressive speakers df = pd.read_parquet("metadata/attribute_subsets/Monotone_vs._Expressive_bucket6_3.4_to_4.0.parquet") print(f"{len(df)} highly expressive samples") ``` ## Empathic Insight Scores (59 dimensions) The `empathic_insight_scores` field (and `eis_*` parquet columns) contains continuous float predictions from 59 MLP experts trained on Whisper encoder embeddings: ### 40 Emotion Scores Amusement, Elation, Pleasure/Ecstasy, Contentment, Thankfulness/Gratitude, Affection, Infatuation, Hope/Enthusiasm/Optimism, Triumph, Pride, Interest, Awe, Astonishment/Surprise, Concentration, Contemplation, Relief, Longing, Teasing, Impatience and Irritability, Sexual Lust, Doubt, Fear, Distress, Confusion, Embarrassment, Shame, Disappointment, Sadness, Bitterness, Contempt, Disgust, Anger, Malevolence/Malice, Sourness, Pain, Helplessness, Fatigue/Exhaustion, Emotional Numbness, Intoxication/Altered States, Jealousy & Envy ### 15 Attribute Scores Valence, Arousal, Submissive vs. Dominant, Age, Gender, Serious vs. Humorous, Vulnerable vs. Emotionally Detached, Confident vs. Hesitant, Warm vs. Cold, Monotone vs. Expressive, High-Pitched vs. Low-Pitched, Soft vs. Harsh, Authenticity, Recording Quality, Background Noise ### 4 Quality Scores `score_overall_quality`, `score_speech_quality`, `score_background_quality`, `score_content_enjoyment` ## Speaker Reference System The emotion subset and attribute bucket parquet files include pre-computed speaker references that enable finding voice-similar samples with different emotional/attribute profiles. For each sample, three references are provided: 1. **`ref_most_similar`** - The closest speaker match by cosine similarity in the 128-dim speaker embedding space (excluding near-duplicates with similarity >= 0.99). 2. **`ref_similar_emotions`** - Among speaker-similar candidates (cosine >= 0.9, or top 10 if none reach 0.9), the one with the most similar emotion profile (highest cosine similarity across 40 emotion dimensions). 3. **`ref_dissimilar_emotions`** - Among speaker-similar candidates, the one with the most different emotion profile. This enables finding the **same voice expressing different emotions** - useful for emotion transfer, voice conversion, and contrastive learning. Each reference is a JSON string containing: ```json { "sample_id": "batch100_part0_...", "tar_file": "batch100_part0.tar", "audio_filename": "batch100_part0_..._chunk_1893_1_1552911.wav", "tar_url": "https://huggingface.co/datasets/ai-music4you3/enhanced-audiosnippets-long-2-8M/resolve/main/batch100_part0.tar", "speaker_similarity": 0.9523 } ``` ## Source Dataset Enhanced from [mitermix/audiosnippets_long_2_8M](https://huggingface.co/datasets/mitermix/audiosnippets_long_2_8M). Original metadata (captions, transcriptions, emotion vectors, detailed captions) is preserved alongside new annotations. ## Models Used | Model | Purpose | Reference | |-------|---------|-----------| | [ClearerVoice MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Studio) | Speech Enhancement (48kHz) | MossFormer2 | | [laion/BUD-E-Whisper](https://huggingface.co/laion/BUD-E-Whisper) | Audio captioning | Whisper-based | | [laion/Empathic-Insight-Voice-Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) | 59 emotion/quality scores | MLP on Whisper embeddings | | [Orange/Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) | 128-dim speaker embeddings | WavLM-based |

提供机构：

ai-music4you3

5,000+

优质数据集

54 个

任务类型

进入经典数据集