laion/emolia-balanced-5M-subset
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/laion/emolia-balanced-5M-subset
下载链接
链接失效反馈官方服务:
资源简介:
# emolia-balanced-5M-subset
A balanced ~5.26M-sample subset of [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) (80.5M speech samples), packaged as [WebDataset](https://github.com/webdataset/webdataset)-compatible tar shards for direct use in training pipelines.
---
## How this subset was filtered
Samples were selected if they met **either** of two criteria:
### 1. Emotion thresholds
Each sample carries 40 emotion annotation scores (from the Emonet taxonomy) in its metadata. A sample qualifies for an emotion bucket if its score for that emotion meets or exceeds a per-emotion threshold. Thresholds were computed from the full Emolia distribution such that each emotion bucket targets **~100,000 samples**.
Up to **100,000 samples per emotion × 40 emotions** were collected (with overlap — one sample can qualify under multiple emotions).
### 2. Speaker diversity (centroid coverage)
Each sample's 128-dimensional WavLM timbre embedding was assigned to its nearest centroid from a set of **3,000 speaker centroids** (k-means pruned, `centroids_pruned.npy`). Up to **1,000 samples per centroid** were collected to ensure broad speaker diversity across rare voice types.
### Deduplication
After extraction, sample IDs were checked globally — 259 cross-shard duplicates were removed, leaving **5,256,683 unique samples**.
---
## Emotion bucket fill
| Emotion | Threshold | Samples |
|---|---|---|
| Affection | ≥ 1.50 | 100,000 |
| Amusement | ≥ 2.00 | 100,000 |
| Anger | ≥ 1.50 | 100,000 |
| Astonishment/Surprise | ≥ 2.00 | 96,332 ¹ |
| Awe | ≥ 1.00 | 100,000 |
| Bitterness | ≥ 1.00 | 100,000 |
| Concentration | ≥ 2.50 | 100,000 |
| Confusion | ≥ 2.00 | 100,000 |
| Contemplation | ≥ 2.00 | 100,000 |
| Contempt | ≥ 1.50 | 100,000 |
| Contentment | ≥ 1.50 | 100,000 |
| Disappointment | ≥ 1.50 | 100,000 |
| Disgust | ≥ 1.00 | 100,000 |
| Distress | ≥ 1.50 | 100,000 |
| Doubt | ≥ 1.00 | 100,000 |
| Elation | ≥ 2.00 | 100,000 |
| Embarrassment | ≥ 1.00 | 100,000 |
| Emotional Numbness | ≥ 2.00 | 100,000 |
| Fatigue/Exhaustion | ≥ 1.50 | 100,000 |
| Fear | ≥ 1.50 | 100,000 |
| Helplessness | ≥ 1.00 | 100,000 |
| Hope/Optimism | ≥ 2.50 | 100,000 |
| Impatience and Irritability | ≥ 2.00 | 100,000 |
| Infatuation | ≥ 1.00 | 100,000 |
| Interest | ≥ 2.50 | 100,000 |
| Intoxication/Altered States | ≥ 1.50 | 100,000 |
| Jealousy & Envy | ≥ 1.00 | 100,000 |
| Longing | ≥ 1.50 | 100,000 |
| Malevolence/Malice | ≥ 1.50 | 100,000 |
| Pain | ≥ 1.00 | 100,000 |
| Pleasure/Ecstasy | ≥ 1.50 | 100,000 |
| Pride | ≥ 1.50 | 100,000 |
| Relief | ≥ 1.50 | 100,000 |
| Sadness | ≥ 1.50 | 100,000 |
| Sexual Lust | ≥ 1.00 | 100,000 |
| Shame | ≥ 1.00 | 100,000 |
| Sourness | ≥ 1.00 | 100,000 |
| Teasing | ≥ 1.50 | 100,000 |
| Thankfulness/Gratitude | ≥ 3.00 | 100,000 |
| Triumph | ≥ 1.50 | 100,000 |
¹ *Astonishment/Surprise* reached only 96,332 — the rarest emotion in the full 80.5M-sample dataset; all available qualifying samples were included.
---
## Dataset format
Shards are plain tar files containing sequential `.mp3` / `.json` pairs:
```
emolia-000000.tar
000000.mp3
000000.json
000001.mp3
000001.json
...
004999.mp3
004999.json
emolia-000001.tar
...
```
- **1,052 shards** × 5,000 samples = **5,256,683 samples total**
- Audio: MP3, original quality from Emolia
- Metadata: original Emolia JSON with `wavelm_timbre_embedding` stripped (saves ~50% JSON size), plus `__emolia_id__` field
### JSON fields
Each `.json` file is the original Emolia sample metadata plus:
| Field | Meaning |
|---|---|
| `__emolia_id__` | Original Emolia sample ID |
| `id` | Same as `__emolia_id__` (original field) |
| `language` | BCP-47 language code |
| `duration` | Audio duration in seconds |
| `dnsmos` | DNSMOS audio quality score |
| `speaker` | Speaker ID |
| `emotion_annotation` | Dict of 40+ `*_best` emotion scores |
| `characters_per_second` | Speech rate |
---
## Loading in PyTorch
```python
import webdataset as wds
dataset = (
wds.WebDataset("data/emolia-{000000..001051}.tar")
.decode("torch") # decodes .mp3 with torchaudio
.to_tuple("mp3", "json")
)
for audio, meta in dataset:
# audio: torch.Tensor (decoded waveform)
# meta: dict (Emolia metadata)
emotion_scores = meta["emotion_annotation"]
...
```
Or load a single shard directly:
```python
import tarfile, json
with tarfile.open("data/emolia-000000.tar") as tar:
members = {m.name: m for m in tar.getmembers()}
for i in range(5000):
key = f"{i:06d}"
audio = tar.extractfile(members[key + ".mp3"]).read()
meta = json.load(tar.extractfile(members[key + ".json"]))
```
---
## Source dataset
- **Source**: [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) — 80.5M speech samples with 40-dimensional emotion annotations (Emonet taxonomy)
- **Emotion annotations**: continuous regression scores from a fine-tuned audio model, stored under `emotion_annotation.*_best`
- **Speaker embeddings**: 128-dim WavLM timbre embeddings (used for centroid assignment, stripped from this subset's JSON)
- **Thresholds derived from**: 86% of full dataset (~69M samples), with a ×1.16 scale factor to extrapolate to 100k/emotion target
---
## License
Inherits the license of [laion/Emolia](https://huggingface.co/datasets/laion/Emolia). Please refer to the source dataset for usage terms.
提供机构:
laion



