laion/clustered-reference-voices

Name: laion/clustered-reference-voices
Creator: laion
Published: 2026-03-17 17:49:24
License: 暂无描述

Hugging Face2026-03-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/laion/clustered-reference-voices

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 pretty_name: Clustered Reference Voices (EMOLIA 3K) size_categories: - 1K<n<10K task_categories: - audio-classification - text-to-speech tags: - speech - voice-cloning - speaker-embeddings - speech-enhancement - quality-scoring - reference-voices dataset_info: features: - name: cluster_id dtype: int64 - name: overall_quality dtype: float64 - name: speech_quality dtype: float64 - name: background_quality dtype: float64 - name: content_enjoyment dtype: float64 - name: duration dtype: float64 - name: text dtype: string - name: sample_id dtype: string - name: cosine_similarity dtype: float64 splits: - name: train num_examples: 3000 --- # Clustered Reference Voices (EMOLIA 3K) **3,000 enhanced reference voice MP3s** — one high-quality representative sample per speaker cluster, selected and scored by a multi-expert neural quality model. ## Overview | Property | Value | |---|---| | **Total clips** | 3,000 | | **Total duration** | 11.3 hours | | **Mean duration** | 13.5 s (range: 3.5 – 29.9 s) | | **Format** | MP3, 192 kbps, 48 kHz | | **Language** | English | | **Naming** | `{cluster_id}.mp3` (0 – 2999) | ## Source The source data is [laion/emolia-3k-speaker-clusters](https://huggingface.co/datasets/laion/emolia-3k-speaker-clusters), which contains **3,000 speaker clusters** with approximately 20 samples each (59,977 total utterances). Clusters were produced by grouping speaker embeddings from a diverse collection of English speech. ## Processing Pipeline Each of the 59,977 source utterances was processed through a two-stage pipeline: ### 1. Speech Enhancement — ClearerVoice MossFormer2_SE_48K All audio was enhanced at **48 kHz** using the [MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Enhancement) speech enhancement model. This removes background noise, music, reverb, and other non-speech artifacts while preserving the natural characteristics of the speaker's voice. ### 2. Quality Scoring — Empathic Insight Voice Plus Enhanced audio was scored by the **Empathic Insight Voice Plus** model, which employs **59 MLP expert heads** on top of Whisper encoder embeddings. The model produces multiple quality dimensions: | Score | Description | |---|---| | `overall_quality` | Composite quality score (primary selection criterion) | | `speech_quality` | Clarity and naturalness of speech | | `background_quality` | Absence of background noise / artifacts | | `content_enjoyment` | Engaging and well-articulated content | ### 3. Selection — Top Sample per Cluster For each of the 3,000 speaker clusters, the single sample with the **highest `overall_quality` score** was selected as the cluster's representative reference voice. ## Quality Statistics | Metric | Overall Quality | Speech Quality | Background Quality | Content Enjoyment | |---|---|---|---|---| | **Mean** | 3.114 | 1.873 | 3.766 | 4.846 | | **Std** | 0.102 | 0.085 | 0.110 | 0.184 | | **Min** | 2.744 | 1.588 | 3.070 | 3.853 | | **Max** | 3.469 | 2.182 | 4.344 | 5.398 | ## Dataset Files | File | Description | Size | |---|---|---| | `audio.tar.gz` | All 3,000 MP3 files | ~910 MB | | `metadata.parquet` | Quality scores and metadata for all clips | ~500 KB | | `gallery.html` | Interactive HTML gallery with embedded base64 audio, sortable columns, and search | ~5.5 MB | ### Metadata Schema (parquet) | Column | Type | Description | |---|---|---| | `cluster_id` | int64 | Speaker cluster index (0–2999) | | `overall_quality` | float64 | Composite quality score | | `speech_quality` | float64 | Speech clarity / naturalness score | | `background_quality` | float64 | Background cleanliness score | | `content_enjoyment` | float64 | Content engagement score | | `duration` | float64 | Duration in seconds | | `text` | string | Transcript text | | `sample_id` | string | Original sample identifier | | `cosine_similarity` | float64 | Cosine similarity of sample's speaker embedding to cluster centroid | ## Intended Uses - **TTS reference voices**: High-quality, diverse speaker references for text-to-speech systems - **Voice cloning**: Clean, enhanced single-speaker clips suitable as cloning targets - **Speaker verification benchmarks**: One representative per cluster for speaker ID tasks - **Quality filtering research**: Studying the relationship between quality scores and perceptual quality ## Interactive Gallery The included `gallery.html` file provides a self-contained, browser-based interface to explore all 3,000 samples. Features: - Embedded base64 audio playback (no server required) - Sortable columns (click any header) - Full-text search across cluster IDs and transcripts - Quality score display for all dimensions ## Citation ```bibtex @dataset{clustered_reference_voices_2026, title={Clustered Reference Voices (EMOLIA 3K)}, author={LAION}, year={2026}, url={https://huggingface.co/datasets/laion/clustered-reference-voices} } ``` ## License This dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

提供机构：

laion

5,000+

优质数据集

54 个

任务类型

进入经典数据集