five

laion/reference-voices-enhanced

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/laion/reference-voices-enhanced
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - audio-classification - text-to-speech tags: - voice - speaker-embedding - deduplicated - emotion - quality-filtered - speech-enhancement - clearervoice pretty_name: Reference Voices Enhanced size_categories: - 1K<n<10K --- # Reference Voices Enhanced 2,004 AI voice samples enhanced with [ClearerVoice-Studio](https://github.com/modelscope/ClearerVoice-Studio) MossFormer2_SE_48K speech enhancement, annotated with [Empathic Insight Voice Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) (59 quality + emotion scores). ## Dataset Summary - **Source**: [laion/ai-voices-deduplicated](https://huggingface.co/datasets/laion/ai-voices-deduplicated) (2,004 speaker-deduplicated, quality-filtered AI voice samples) - **Speech Enhancement**: ClearerVoice MossFormer2_SE_48K — background noise removal and speech clarity improvement - **Output Format**: Enhanced WAV files at 48kHz (replacing original MP3s) - **Annotations**: Full Empathic Insight Voice Plus scores (59 dimensions: 55 emotion scores + 4 quality scores) - **Metadata**: Updated JSON sidecar per sample with all annotation scores - **Packaging**: Single tar file ## Enhancement Pipeline 1. **Source data**: 2,004 samples from `laion/ai-voices-deduplicated`, organized by gender (male/female/androgynous) and age category (child/teenager/young_adult/adult/elderly) 2. **Speech enhancement**: Each audio sample processed through [ClearerVoice-Studio](https://github.com/modelscope/ClearerVoice-Studio) `MossFormer2_SE_48K` model for noise suppression and speech clarity improvement at 48kHz 3. **Format conversion**: Original MP3 files replaced with enhanced WAV files at 48kHz sample rate 4. **Emotion annotation**: All enhanced samples annotated with [Empathic Insight Voice Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus), providing 59 scores per sample 5. **Metadata update**: JSON sidecar files updated with all annotation scores ## Structure ``` reference-voices-enhanced.tar ├── male/ │ ├── 02_child/ (1 sample) │ ├── 03_teenager/ (5 samples) │ ├── 04_young_adult/ (217 samples) │ ├── 05_adult/ (813 samples) │ └── 08_elderly/ (1 sample) ├── female/ │ ├── 02_child/ (16 samples) │ ├── 03_teenager/ (3 samples) │ ├── 04_young_adult/ (376 samples) │ ├── 05_adult/ (514 samples) │ └── 08_elderly/ (1 sample) └── androgynous/ ├── 02_child/ (13 samples) ├── 04_young_adult/ (19 samples) └── 05_adult/ (25 samples) ``` Each sample consists of: - **`.wav`** — Enhanced audio file (48kHz, WAV format) - **`.json`** — Metadata sidecar with caption, emotion scores, and quality scores ## Gender Distribution | Gender | Count | |---|---| | Male | 1,037 | | Female | 910 | | Androgynous | 57 | | **Total** | **2,004** | ## Annotations ### Quality Scores (4 dimensions) All samples were quality-filtered in the source dataset with: - `score_background_quality >= 3.5` (DNS MOS background quality) - `score_content_enjoyment >= 5.0` (content enjoyment rating) The full set of quality scores in each JSON sidecar: | Score | Description | |---|---| | `score_background_quality` | DNS MOS background quality rating | | `score_content_enjoyment` | Content enjoyment rating | | `score_overall_quality` | Overall audio quality rating | | `score_speech_quality` | Speech-specific quality rating | ### Emotion Scores (55 dimensions) Each JSON sidecar contains 55 emotion and vocal characteristic scores from Empathic Insight Voice Plus, including: - **Core emotions**: Anger, Disgust, Fear, Sadness, Joy/Happiness, Contentment, Amusement, Affection, Awe - **Complex emotions**: Contempt, Confusion, Distress, Disappointment, Bitterness, Nostalgia, Guilt/Shame, Envy/Jealousy - **Social/cognitive**: Concentration, Contemplation, Determination, Pride, Relief, Sarcasm/Irony, Triumph - **Surprise variants**: Astonishment/Surprise, Excitement - **Vocal characteristics**: Age, Arousal, Valence, Dominance, Authenticity, Monotone vs. Expressive, Confident vs. Hesitant, Formal vs. Casual, Fast vs. Slow, Loud vs. Soft, Staccato vs. Legato, Tense vs. Relaxed, Nasal, Breathy/Whisper, Creaky/Vocal Fry, Trembling/Shaky, Lisp/Speech Impediment, Accent Strength - **Additional**: Background Noise, Music/Singing, Laughter, Non-speech Sounds, Reverberation, Multiple Speakers ## Speech Enhancement Details The [ClearerVoice-Studio](https://github.com/modelscope/ClearerVoice-Studio) MossFormer2_SE_48K model is a state-of-the-art speech enhancement model that: - Removes background noise while preserving speech quality - Operates at 48kHz for high-fidelity output - Uses the MossFormer2 architecture optimized for speech enhancement (SE) - Produces clean, broadcast-quality speech suitable for TTS reference voices ## Source Dataset Lineage ``` laion/reference_ai_voices_with_timbre_annotations (17 tar files, ~32,000 samples) │ ├── Quality filtering (score_background_quality >= 3.5, score_content_enjoyment >= 5.0) │ → 11,473 samples │ ├── Speaker deduplication (Orange/Speaker-wavLM-tbr embeddings, agglomerative clustering) │ → 2,004 unique speakers │ └── laion/ai-voices-deduplicated (2,004 samples, MP3) │ ├── ClearerVoice MossFormer2_SE_48K speech enhancement ├── Format conversion to 48kHz WAV ├── Empathic Insight Voice Plus annotation (59 scores) │ └── laion/reference-voices-enhanced (2,004 samples, WAV @ 48kHz) ← this dataset ``` ## Usage ```python from huggingface_hub import hf_hub_download import tarfile # Download the tar file path = hf_hub_download( "laion/reference-voices-enhanced", "reference-voices-enhanced.tar", repo_type="dataset" ) # Extract with tarfile.open(path) as tar: tar.extractall("./reference-voices-enhanced") ``` ## Citation If you use this dataset, please cite the source dataset and the tools used: - [LAION AI Voices Deduplicated](https://huggingface.co/datasets/laion/ai-voices-deduplicated) - [ClearerVoice-Studio](https://github.com/modelscope/ClearerVoice-Studio) - [Empathic Insight Voice Plus](https://huggingface.co/laion/Empathic-Insight-Voice-Plus) - [Orange Speaker-wavLM-tbr](https://huggingface.co/Orange/Speaker-wavLM-tbr) ## License CC-BY-4.0
提供机构:
laion
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作