laion/emolia-balanced-5M-subset

Name: laion/emolia-balanced-5M-subset
Creator: laion
Published: 2026-04-19 23:13:56
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/laion/emolia-balanced-5M-subset

下载链接

链接失效反馈

官方服务：

资源简介：

# emolia-balanced-5M-subset A balanced ~5.26M-sample subset of [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) (80.5M speech samples), packaged as [WebDataset](https://github.com/webdataset/webdataset)-compatible tar shards for direct use in training pipelines. --- ## How this subset was filtered Samples were selected if they met **either** of two criteria: ### 1. Emotion thresholds Each sample carries 40 emotion annotation scores (from the Emonet taxonomy) in its metadata. A sample qualifies for an emotion bucket if its score for that emotion meets or exceeds a per-emotion threshold. Thresholds were computed from the full Emolia distribution such that each emotion bucket targets **~100,000 samples**. Up to **100,000 samples per emotion × 40 emotions** were collected (with overlap — one sample can qualify under multiple emotions). ### 2. Speaker diversity (centroid coverage) Each sample's 128-dimensional WavLM timbre embedding was assigned to its nearest centroid from a set of **3,000 speaker centroids** (k-means pruned, `centroids_pruned.npy`). Up to **1,000 samples per centroid** were collected to ensure broad speaker diversity across rare voice types. ### Deduplication After extraction, sample IDs were checked globally — 259 cross-shard duplicates were removed, leaving **5,256,683 unique samples**. --- ## Emotion bucket fill | Emotion | Threshold | Samples | |---|---|---| | Affection | ≥ 1.50 | 100,000 | | Amusement | ≥ 2.00 | 100,000 | | Anger | ≥ 1.50 | 100,000 | | Astonishment/Surprise | ≥ 2.00 | 96,332 ¹ | | Awe | ≥ 1.00 | 100,000 | | Bitterness | ≥ 1.00 | 100,000 | | Concentration | ≥ 2.50 | 100,000 | | Confusion | ≥ 2.00 | 100,000 | | Contemplation | ≥ 2.00 | 100,000 | | Contempt | ≥ 1.50 | 100,000 | | Contentment | ≥ 1.50 | 100,000 | | Disappointment | ≥ 1.50 | 100,000 | | Disgust | ≥ 1.00 | 100,000 | | Distress | ≥ 1.50 | 100,000 | | Doubt | ≥ 1.00 | 100,000 | | Elation | ≥ 2.00 | 100,000 | | Embarrassment | ≥ 1.00 | 100,000 | | Emotional Numbness | ≥ 2.00 | 100,000 | | Fatigue/Exhaustion | ≥ 1.50 | 100,000 | | Fear | ≥ 1.50 | 100,000 | | Helplessness | ≥ 1.00 | 100,000 | | Hope/Optimism | ≥ 2.50 | 100,000 | | Impatience and Irritability | ≥ 2.00 | 100,000 | | Infatuation | ≥ 1.00 | 100,000 | | Interest | ≥ 2.50 | 100,000 | | Intoxication/Altered States | ≥ 1.50 | 100,000 | | Jealousy & Envy | ≥ 1.00 | 100,000 | | Longing | ≥ 1.50 | 100,000 | | Malevolence/Malice | ≥ 1.50 | 100,000 | | Pain | ≥ 1.00 | 100,000 | | Pleasure/Ecstasy | ≥ 1.50 | 100,000 | | Pride | ≥ 1.50 | 100,000 | | Relief | ≥ 1.50 | 100,000 | | Sadness | ≥ 1.50 | 100,000 | | Sexual Lust | ≥ 1.00 | 100,000 | | Shame | ≥ 1.00 | 100,000 | | Sourness | ≥ 1.00 | 100,000 | | Teasing | ≥ 1.50 | 100,000 | | Thankfulness/Gratitude | ≥ 3.00 | 100,000 | | Triumph | ≥ 1.50 | 100,000 | ¹ *Astonishment/Surprise* reached only 96,332 — the rarest emotion in the full 80.5M-sample dataset; all available qualifying samples were included. --- ## Dataset format Shards are plain tar files containing sequential `.mp3` / `.json` pairs: ``` emolia-000000.tar 000000.mp3 000000.json 000001.mp3 000001.json ... 004999.mp3 004999.json emolia-000001.tar ... ``` - **1,052 shards** × 5,000 samples = **5,256,683 samples total** - Audio: MP3, original quality from Emolia - Metadata: original Emolia JSON with `wavelm_timbre_embedding` stripped (saves ~50% JSON size), plus `__emolia_id__` field ### JSON fields Each `.json` file is the original Emolia sample metadata plus: | Field | Meaning | |---|---| | `__emolia_id__` | Original Emolia sample ID | | `id` | Same as `__emolia_id__` (original field) | | `language` | BCP-47 language code | | `duration` | Audio duration in seconds | | `dnsmos` | DNSMOS audio quality score | | `speaker` | Speaker ID | | `emotion_annotation` | Dict of 40+ `*_best` emotion scores | | `characters_per_second` | Speech rate | --- ## Loading in PyTorch ```python import webdataset as wds dataset = ( wds.WebDataset("data/emolia-{000000..001051}.tar") .decode("torch") # decodes .mp3 with torchaudio .to_tuple("mp3", "json") ) for audio, meta in dataset: # audio: torch.Tensor (decoded waveform) # meta: dict (Emolia metadata) emotion_scores = meta["emotion_annotation"] ... ``` Or load a single shard directly: ```python import tarfile, json with tarfile.open("data/emolia-000000.tar") as tar: members = {m.name: m for m in tar.getmembers()} for i in range(5000): key = f"{i:06d}" audio = tar.extractfile(members[key + ".mp3"]).read() meta = json.load(tar.extractfile(members[key + ".json"])) ``` --- ## Source dataset - **Source**: [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) — 80.5M speech samples with 40-dimensional emotion annotations (Emonet taxonomy) - **Emotion annotations**: continuous regression scores from a fine-tuned audio model, stored under `emotion_annotation.*_best` - **Speaker embeddings**: 128-dim WavLM timbre embeddings (used for centroid assignment, stripped from this subset's JSON) - **Thresholds derived from**: 86% of full dataset (~69M samples), with a ×1.16 scale factor to extrapolate to 100k/emotion target --- ## License Inherits the license of [laion/Emolia](https://huggingface.co/datasets/laion/Emolia). Please refer to the source dataset for usage terms.

提供机构：

laion

5,000+

优质数据集

54 个

任务类型

进入经典数据集