five

TTS-AGI/emolia-hq

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/emolia-hq
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - audio-classification - text-to-speech language: - de - en - fr - ja - ko - zh tags: - emotion - speech - audio - webdataset - speaker-verification pretty_name: Emolia-HQ size_categories: - 10M<n<100M --- # Emolia-HQ **Emolia-HQ** is a high-quality, speaker-paired subset of the [LAION Emolia](https://huggingface.co/datasets/laion/Emolia) dataset. Each sample includes a target utterance and a reference utterance from the **same speaker**, enabling speaker-conditioned tasks such as voice conversion, expressive TTS, and speaker-aware emotion recognition. ## Source Derived from [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) by: 1. **Quality filtering**: Only samples with `dnsmos >= 3.0` are retained. 2. **Speaker pairing**: Each target sample is matched with a reference audio from the same speaker (different utterance), forming a "quadruplet". Samples where no same-speaker reference exists are included as pairs (target only). 3. **Metadata enrichment**: `speaker_id` and `language_id` fields are extracted from the key and injected into each sample's JSON metadata. ## Data Format The dataset is stored as **WebDataset** `.tar` files, organized by language: ``` emolia_hq/ DE/ # German (243 tars, ~130 GB) EN/ # English (2,380 tars, ~2,476 GB) FR/ # French (298 tars, ~187 GB) JA/ # Japanese (96 tars, ~163 GB) KO/ # Korean (246 tars, ~79 GB) ZH/ # Chinese (929 tars, ~1,681 GB) ``` Each sample within a tar file is grouped by a shared base key: ### Quadruplet (target + same-speaker reference) | File | Description | |------|-------------| | `<key>.mp3` | Target audio | | `<key>.json` | Target metadata | | `<key>.ref.mp3` | Reference audio (same speaker, different utterance) | | `<key>.ref.json` | Reference metadata | ### Pair (no reference found) | File | Description | |------|-------------| | `<key>.mp3` | Target audio | | `<key>.json` | Target metadata | ## JSON Metadata Fields | Field | Description | |-------|-------------| | `id` | Unique utterance ID | | `text` | Transcription | | `duration` | Audio duration in seconds | | `dnsmos` | DNS-MOS quality score (all >= 3.0) | | `speaker` | Original speaker ID | | `speaker_id` | Extracted speaker ID (e.g., `DE_B00000_S00010`) | | `language_id` | Extracted language code (e.g., `DE`) | | `language` | Language code lowercase | | `emotion_caption` | Natural language description of the emotional content | | `emotion_annotation` | Dictionary of 50+ emotion/prosody scores | | `characters_per_second` | Speaking rate | | `wavelm_timbre_embedding` | 128-dim speaker timbre embedding | ## Statistics | Language | Tars | Size | |----------|------|------| | DE (German) | 243 | ~130 GB | | EN (English) | 2,380 | ~2,476 GB | | FR (French) | 298 | ~187 GB | | JA (Japanese) | 96 | ~163 GB | | KO (Korean) | 246 | ~79 GB | | ZH (Chinese) | 929 | ~1,681 GB | | **Total** | **4,192** | **~4,716 GB** | ~97% of samples include a same-speaker reference audio (quadruplets). The remaining ~3% are pairs where the speaker only appeared once across the entire dataset. ## Usage ```python import webdataset as wds dataset = wds.WebDataset("emolia_hq/EN/EN-B000000_standard_hq.tar") for sample in dataset: key = sample["__key__"] target_audio = sample["mp3"] # bytes target_meta = sample["json"] # bytes -> json.loads() ref_audio = sample.get("ref.mp3") # bytes or None ref_meta = sample.get("ref.json") # bytes or None ``` ## License Same as the source Emolia dataset. See [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) for details.
提供机构:
TTS-AGI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作