laion/clustered-reference-voices
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/laion/clustered-reference-voices
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
pretty_name: Clustered Reference Voices (EMOLIA 3K)
size_categories:
- 1K<n<10K
task_categories:
- audio-classification
- text-to-speech
tags:
- speech
- voice-cloning
- speaker-embeddings
- speech-enhancement
- quality-scoring
- reference-voices
dataset_info:
features:
- name: cluster_id
dtype: int64
- name: overall_quality
dtype: float64
- name: speech_quality
dtype: float64
- name: background_quality
dtype: float64
- name: content_enjoyment
dtype: float64
- name: duration
dtype: float64
- name: text
dtype: string
- name: sample_id
dtype: string
- name: cosine_similarity
dtype: float64
splits:
- name: train
num_examples: 3000
---
# Clustered Reference Voices (EMOLIA 3K)
**3,000 enhanced reference voice MP3s** — one high-quality representative sample per speaker cluster, selected and scored by a multi-expert neural quality model.
## Overview
| Property | Value |
|---|---|
| **Total clips** | 3,000 |
| **Total duration** | 11.3 hours |
| **Mean duration** | 13.5 s (range: 3.5 – 29.9 s) |
| **Format** | MP3, 192 kbps, 48 kHz |
| **Language** | English |
| **Naming** | `{cluster_id}.mp3` (0 – 2999) |
## Source
The source data is [laion/emolia-3k-speaker-clusters](https://huggingface.co/datasets/laion/emolia-3k-speaker-clusters), which contains **3,000 speaker clusters** with approximately 20 samples each (59,977 total utterances). Clusters were produced by grouping speaker embeddings from a diverse collection of English speech.
## Processing Pipeline
Each of the 59,977 source utterances was processed through a two-stage pipeline:
### 1. Speech Enhancement — ClearerVoice MossFormer2_SE_48K
All audio was enhanced at **48 kHz** using the [MossFormer2_SE_48K](https://github.com/modelscope/ClearerVoice-Enhancement) speech enhancement model. This removes background noise, music, reverb, and other non-speech artifacts while preserving the natural characteristics of the speaker's voice.
### 2. Quality Scoring — Empathic Insight Voice Plus
Enhanced audio was scored by the **Empathic Insight Voice Plus** model, which employs **59 MLP expert heads** on top of Whisper encoder embeddings. The model produces multiple quality dimensions:
| Score | Description |
|---|---|
| `overall_quality` | Composite quality score (primary selection criterion) |
| `speech_quality` | Clarity and naturalness of speech |
| `background_quality` | Absence of background noise / artifacts |
| `content_enjoyment` | Engaging and well-articulated content |
### 3. Selection — Top Sample per Cluster
For each of the 3,000 speaker clusters, the single sample with the **highest `overall_quality` score** was selected as the cluster's representative reference voice.
## Quality Statistics
| Metric | Overall Quality | Speech Quality | Background Quality | Content Enjoyment |
|---|---|---|---|---|
| **Mean** | 3.114 | 1.873 | 3.766 | 4.846 |
| **Std** | 0.102 | 0.085 | 0.110 | 0.184 |
| **Min** | 2.744 | 1.588 | 3.070 | 3.853 |
| **Max** | 3.469 | 2.182 | 4.344 | 5.398 |
## Dataset Files
| File | Description | Size |
|---|---|---|
| `audio.tar.gz` | All 3,000 MP3 files | ~910 MB |
| `metadata.parquet` | Quality scores and metadata for all clips | ~500 KB |
| `gallery.html` | Interactive HTML gallery with embedded base64 audio, sortable columns, and search | ~5.5 MB |
### Metadata Schema (parquet)
| Column | Type | Description |
|---|---|---|
| `cluster_id` | int64 | Speaker cluster index (0–2999) |
| `overall_quality` | float64 | Composite quality score |
| `speech_quality` | float64 | Speech clarity / naturalness score |
| `background_quality` | float64 | Background cleanliness score |
| `content_enjoyment` | float64 | Content engagement score |
| `duration` | float64 | Duration in seconds |
| `text` | string | Transcript text |
| `sample_id` | string | Original sample identifier |
| `cosine_similarity` | float64 | Cosine similarity of sample's speaker embedding to cluster centroid |
## Intended Uses
- **TTS reference voices**: High-quality, diverse speaker references for text-to-speech systems
- **Voice cloning**: Clean, enhanced single-speaker clips suitable as cloning targets
- **Speaker verification benchmarks**: One representative per cluster for speaker ID tasks
- **Quality filtering research**: Studying the relationship between quality scores and perceptual quality
## Interactive Gallery
The included `gallery.html` file provides a self-contained, browser-based interface to explore all 3,000 samples. Features:
- Embedded base64 audio playback (no server required)
- Sortable columns (click any header)
- Full-text search across cluster IDs and transcripts
- Quality score display for all dimensions
## Citation
```bibtex
@dataset{clustered_reference_voices_2026,
title={Clustered Reference Voices (EMOLIA 3K)},
author={LAION},
year={2026},
url={https://huggingface.co/datasets/laion/clustered-reference-voices}
}
```
## License
This dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
提供机构:
laion



