Emilia-Dataset
收藏魔搭社区2026-05-22 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/amphion/Emilia-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
<!-- [](https://arxiv.org/abs/2407.05361) [](https://huggingface.co/datasets/amphion/Emilia-Dataset) [](https://opendatalab.com/Amphion/Emilia) [](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) [](https://emilia-dataset.github.io/Emilia-Demo-Page/)
-->
This is the official repository 👑 for the **Emilia** dataset and the source code for the **Emilia-Pipe** speech data preprocessing pipeline.
<div align="center"><img width="500px" src="https://github.com/user-attachments/assets/b1c1a1f8-3149-4f96-8eb4-af470152a9b7" /></div>
## News 🔥
- **2025/02/26**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`)!!!
- **2025/01/27**: We release the extended version of Emilia's paper on [arXiv](https://arxiv.org/abs/2501.15907)! More experiments and more insights!
- **2024/12/04**: We present Emilia at the [IEEE SLT 2024](https://2024.ieeeslt.org/)!
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/ZxxREr3Y) to stay connected and engage with our community!
- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [OpenDataLab](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
- **2024/07/08**: Our preprint [paper](https://arxiv.org/abs/2407.05361) is now available! 🔥🔥🔥
- **2024/07/03**: We welcome everyone to check our [homepage](https://emilia-dataset.github.io/Emilia-Demo-Page/) for our brief introduction for Emilia dataset and our demos!
- **2024/07/01**: We release of Emilia and Emilia-Pipe! We welcome everyone to explore it on our [GitHub](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)! 🎉🎉🎉
## Emilia-Large Overview ⭐️
The **Emilia-Large** dataset is a comprehensive, multilingual dataset with the following features:
- with *Emilia* containing over *101k* hours and *Emilia-YODAS* containing over *114k* hours of speech data;
- covering six different languages: *English (En), Chinese (Zh), German (De), French (Fr), Japanese (Ja), and Korean (Ko)*;
- containing diverse speech data with *various speaking styles* from diverse video platforms and podcasts on the Internet, covering various content genres such as talk shows, interviews, debates, sports commentary, and audiobooks.
The table below provides the duration statistics for each language in the dataset.
| Language | Emilia Duration (hours) | Emilia-YODAS Duration (hours) | Total Duration (hours) |
|:-----------:|:-----------------------:|:----------------------------:|:----------------------:|
| English | 46.8k | 92.2k | 139.0k |
| Chinese | 49.9k | 0.3k | 50.3k |
| German | 1.6k | 5.6k | 7.2k |
| French | 1.4k | 7.4k | 8.8k |
| Japanese | 1.7k | 1.1k | 2.8k |
| Korean | 0.2k | 7.3k | 7.5k |
| **Total** | **101.7k** | **113.9k** | **215.6k** |
The **Emilia-Pipe** is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation. This pipeline can process one hour of raw audio into model-ready data in just a few minutes, requiring only the raw speech data.
Detailed descriptions for the Emilia and Emilia-Pipe can be found in our [paper](https://arxiv.org/abs/2407.05361), and [extended version](https://arxiv.org/abs/2501.15907).
## Emilia Dataset Usage 📖
Emilia and Emilia-YODAS is publicly available at [HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset).
- Option 1: Download from HuggingFace:
1. Gain access to the dataset and get the HF access token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
2. Install dependencies and login HF:
- Install Python
- Run `pip install librosa soundfile datasets huggingface_hub[cli]`
- Login by `huggingface-cli login` and paste the HF access token. Check [here](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-login) for details.
3. Use following code to load Emilia and Emilia-YODAS:
```py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset", streaming=True)
print(dataset) # features: ['json', 'mp3', '__key__', '__url__'], num_shards: 4343
print(next(iter(dataset['train'])))
```
- Option 2: Download from [OpenDataLab](https://opendatalab.com/Amphion/Emilia) (i.e., OpenXLab)
- If you are from mainland China or having a connecting issue with HuggingFace, you can download Emilia from OpenDataLab.
- Please follow the guidance [here](https://speechteam.feishu.cn/wiki/PC8Ew5igviqBiJkElMJcJxNonJc) to gain access.
- Note: On OpenDataLab, Emilia is available, but Emilia-YODAS is not.
**ENJOY USING EMILIA!!!** 🔥
### Use cases
If you only want to use Emilia-YODAS, you can use:
```py
from datasets import load_dataset
path = "Emilia-YODAS/**/*.tar" # Same for Emilia; just replace "Emilia-YODAS/" with "Emilia/"
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"train": path}, split="train", streaming=True)
print(dataset) # here should only shows 1983 n_shards
print(next(iter(dataset)))
```
If you want to load a subset of Emilia/Emilia-YODAS, e.g., only language `DE`, you can use the following code:
```py
from datasets import load_dataset
path = "Emilia/DE/*.tar" # Same for Emilia-YODAS; just replace "Emilia/" with "Emilia-YODAS/"
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"de": path}, split="de", streaming=True)
print(dataset) # here should only shows 90 n_shards
print(next(iter(dataset)))
```
If you want to download all files to your local before using Emilia and Emilia-YODAS, remove the `streaming=True` argument:
```py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset")
print(dataset)
```
### Re-build or Processing your own data
If you wish to re-build Emilia from scratch, you may download the raw audio files from the [provided URL list](https://huggingface.co/datasets/amphion/Emilia) and use our open-source [Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) preprocessing pipeline to preprocess the raw data. Additionally, users can easily use Emilia-Pipe to preprocess their own raw speech data for custom needs. By open-sourcing the Emilia-Pipe code, we aim to enable the speech community to collaborate on large-scale speech generation research.
### Notes
1. Please note that Emilia does not own the copyright to the audio files; the copyright remains with the original owners of the videos or audio. Users are permitted to use Emilia dataset only for non-commercial purposes under the `CC BY-NC-4.0` license.
2. For data in Emilia-YODSA, we download the raw data from [espnet/yodas2](https://huggingface.co/datasets/espnet/yodas2), and use the same license family: `CC BY 4.0`.
## Emilia Dataset Structure ⛪️
### Structure on HuggingFace
On HuggingFace, Emilia and Emilia-YODAS is formatted as [WebDataset](https://github.com/webdataset/webdataset).
Each audio is tared with a corresponding JSON file (having the same prefix filename) within 4,343 tar files.
| Dataset | Size | # of Tars |
|----------------|--------|--------|
| Emilia | 2.4TB | 2,360 |
| Emilia-YODAS | 2.1TB | 1,983 |
| **Total** | 4.5TB | 4,343 |
By utilizing WebDataset, you can easily stream audio data, which is magnitude faster than reading separate data files one by one.
Read the *Emilia Dataset Usage 📖* part for a detailed usage guide.
Learn more about WebDataset [here](https://huggingface.co/docs/hub/datasets-webdataset).
*PS: If you want to download the `OpenDataLab` format from HuggingFace, you can specify the `revision` argument to `fc71e07e8572f5f3be1dbd02ed3172a4d298f152`, [which](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152) is the old format.*
### Structure on OpenDataLab
On OpenDataLab, Emilia is formatted using the following structure. *Note: On OpenDataLab, Emilia is available, but Emilia-YODAS is not.*
Structure example:
```
|-- openemilia_all.tar.gz (all .JSONL files are gzipped with directory structure in this file)
|-- EN (114 batches)
| |-- EN_B00000.jsonl
| |-- EN_B00000 (= EN_B00000.tar.gz)
| | |-- EN_B00000_S00000
| | | `-- mp3
| | | |-- EN_B00000_S00000_W000000.mp3
| | | `-- EN_B00000_S00000_W000001.mp3
| | |-- ...
| |-- ...
| |-- EN_B00113.jsonl
| `-- EN_B00113
|-- ZH (92 batches)
|-- DE (9 batches)
|-- FR (10 batches)
|-- JA (7 batches)
|-- KO (4 batches)
```
JSONL files example:
```
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}
```
## Reference 📖
If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the following papers:
```bibtex
@inproceedings{emilialarge,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
booktitle={arXiv:2501.15907},
year={2025}
}
@inproceedings{emilia,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
booktitle={Proc.~of SLT},
year={2024}
}
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
year={2024}
}
```
# Emilia:面向大规模语音生成的多语言多样化语音数据集
<!-- [](https://arxiv.org/abs/2407.05361) [](https://huggingface.co/datasets/amphion/Emilia-Dataset) [](https://opendatalab.com/Amphion/Emilia) [](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) [](https://emilia-dataset.github.io/Emilia-Demo-Page/) -->
本仓库为**Emilia**数据集的官方仓库,同时包含**Emilia-Pipe**语音数据预处理流水线的源代码。
<div align="center"><img width="500px" src="https://github.com/user-attachments/assets/b1c1a1f8-3149-4f96-8eb4-af470152a9b7" /></div>
## 动态更新 🔥
- **2025/02/26**:*总时长超20万小时的Emilia-Large数据集正式发布!!!* 该数据集将原10.1万小时的Emilia数据集(采用`CC BY-NC 4.0`许可协议)与全新发布的11.4万小时**Emilia-YODAS数据集**(采用`CC BY 4.0`许可协议)进行整合!
- **2025/01/27**:我们在arXiv发布了Emilia的扩展版论文[arXiv](https://arxiv.org/abs/2501.15907)!包含更多实验与研究见解!
- **2024/12/04**:我们在[IEEE SLT 2024](https://2024.ieeeslt.org/)会议上展示了Emilia数据集!
- **2024/08/28**:欢迎加入Amphion的[Discord频道](https://discord.com/invite/ZxxREr3Y),与我们的社区保持联系并参与交流!
- **2024/08/27**:*Emilia数据集现已正式公开!* 这款规模最大、品类最丰富的语音生成数据集,包含10.1万小时的互联网原生语音数据,现已可在[HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset)或[OpenDataLab](https://opendatalab.com/Amphion/Emilia)获取!👑👑👑
- **2024/07/08**:我们的预印本论文[paper](https://arxiv.org/abs/2407.05361)现已上线!🔥🔥🔥
- **2024/07/03**:欢迎访问我们的[项目主页](https://emilia-dataset.github.io/Emilia-Demo-Page/),了解Emilia数据集的简要介绍与演示!
- **2024/07/01**:我们正式发布Emilia与Emilia-Pipe!欢迎在我们的[GitHub仓库](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)中探索!🎉🎉🎉
## Emilia-Large数据集概览 ⭐️
**Emilia-Large**数据集是一款多语言综合数据集,具备以下特性:
- 数据总时长:Emilia数据集超10.1万小时,Emilia-YODAS数据集超11.4万小时;
- 覆盖6种语言:英语(En)、中文(Zh)、德语(De)、法语(Fr)、日语(Ja)以及韩语(Ko);
- 包含来自互联网各类视频平台与播客的多样化语音数据,涵盖多种口语风格,覆盖脱口秀、访谈、辩论、体育解说以及有声书等多种内容品类。
下表展示了数据集中各语言的时长统计:
| 语言 | Emilia数据集时长(小时) | Emilia-YODAS数据集时长(小时) | 总时长(小时) |
|:-----------:|:-----------------------:|:----------------------------:|:----------------------:|
| 英语 | 4.68万 | 9.22万 | 13.90万 |
| 中文 | 4.99万 | 0.03万 | 5.03万 |
| 德语 | 0.16万 | 0.56万 | 0.72万 |
| 法语 | 0.14万 | 0.74万 | 0.88万 |
| 日语 | 0.17万 | 0.11万 | 0.28万 |
| 韩语 | 0.02万 | 0.73万 | 0.75万 |
| **总计** | **10.17万** | **11.39万** | **21.56万** |
**Emilia-Pipe**是首款开源预处理流水线,旨在将互联网原生的原始语音数据转换为带标注的高质量语音生成训练数据。该流水线仅需原始语音数据,即可在数分钟内完成单小时原始音频到模型可用训练数据的转换。
关于Emilia与Emilia-Pipe的详细描述,请参见我们的[论文](https://arxiv.org/abs/2407.05361)与[扩展版论文](https://arxiv.org/abs/2501.15907)。
## Emilia数据集使用指南 📖
Emilia与Emilia-YODAS数据集可在[HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset)公开获取。
- 方案1:从HuggingFace下载:
1. 访问[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)获取数据集访问权限与HF访问令牌。
2. 安装依赖并登录HuggingFace:
- 安装Python环境
- 执行命令`pip install librosa soundfile datasets huggingface_hub[cli]`
- 执行`huggingface-cli login`并粘贴HF访问令牌,详细说明请参见[此处](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-login)。
3. 使用如下代码加载Emilia与Emilia-YODAS数据集:
py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset", streaming=True)
print(dataset) # features: ['json', 'mp3', '__key__', '__url__'], num_shards: 4343
print(next(iter(dataset['train'])))
- 方案2:从[OpenDataLab](https://opendatalab.com/Amphion/Emilia)(即OpenXLab)下载
- 若您身处中国大陆地区或与HuggingFace存在网络连接问题,可从OpenDataLab下载Emilia数据集。
- 请按照[此处](https://speechteam.feishu.cn/wiki/PC8Ew5igviqBiJkElMJcJxNonJc)的指引获取访问权限。
- 注意:在OpenDataLab平台上,仅提供Emilia数据集,不包含Emilia-YODAS数据集。
**欢迎使用Emilia数据集!!!** 🔥
### 使用场景
若仅需使用Emilia-YODAS数据集,可使用如下代码:
py
from datasets import load_dataset
path = "Emilia-YODAS/**/*.tar" # 针对Emilia数据集只需将"Emilia-YODAS/"替换为"Emilia/"即可
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"train": path}, split="train", streaming=True)
print(dataset) # 此时将仅显示1983个分片
print(next(iter(dataset)))
若需加载Emilia/Emilia-YODAS的指定子集,例如仅加载德语(DE)数据,可使用如下代码:
py
from datasets import load_dataset
path = "Emilia/DE/*.tar" # 针对Emilia-YODAS数据集只需将"Emilia/"替换为"Emilia-YODAS/"即可
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"de": path}, split="de", streaming=True)
print(dataset) # 此时将仅显示90个分片
print(next(iter(dataset)))
若需在使用前将所有文件下载至本地,请移除`streaming=True`参数:
py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset")
print(dataset)
### 自定义重建与数据处理
若需从零开始复现Emilia数据集,可从[提供的URL列表](https://huggingface.co/datasets/amphion/Emilia)下载原始音频文件,并使用我们开源的[Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)预处理流水线处理原始数据。此外,用户也可轻松使用Emilia-Pipe根据自定义需求预处理自有原始语音数据。我们开源Emilia-Pipe代码,旨在助力语音研究社区开展大规模语音生成相关研究协作。
### 注意事项
1. 请注意,Emilia数据集不享有音频文件的版权,版权仍归属于原视频或音频的所有者。用户仅可在`CC BY-NC-4.0`许可协议下将Emilia数据集用于非商业用途。
2. 对于Emilia-YODSA中的数据,我们从[espnet/yodas2](https://huggingface.co/datasets/espnet/yodas2)下载原始数据,并采用相同的许可协议:`CC BY 4.0`。
## Emilia数据集结构 ⛪️
### HuggingFace平台上的数据集结构
在HuggingFace平台上,Emilia与Emilia-YODAS采用[Web数据集 (WebDataset)](https://github.com/webdataset/webdataset)格式进行存储。
每个音频文件与对应的JSON文件(文件名前缀一致)被打包至4343个tar文件中。
| 数据集 | 大小 | Tar文件数量 |
|----------------|--------|--------|
| Emilia | 2.4TB | 2,360 |
| Emilia-YODAS | 2.1TB | 1,983 |
| **总计** | 4.5TB | 4,343 |
通过使用Web数据集,您可以轻松实现音频数据流加载,其速度远高于逐个读取独立数据文件的方式。
详细使用指南请参见*Emilia数据集使用指南 📖*部分。
如需了解更多Web数据集相关信息,请访问[此处](https://huggingface.co/docs/hub/datasets-webdataset)。
*附注:若需从HuggingFace下载OpenDataLab格式的数据集,可将`revision`参数指定为`fc71e07e8572f5f3be1dbd02ed3172a4d298f152`,该版本为旧格式。*
### OpenDataLab平台上的数据集结构
在OpenDataLab平台上,Emilia数据集采用如下结构进行存储。*注意:在OpenDataLab平台上,仅提供Emilia数据集,不包含Emilia-YODAS数据集。*
结构示例:
|-- openemilia_all.tar.gz(该文件内为带有目录结构的所有.gz压缩JSONL文件)
|-- EN(114个批次)
| |-- EN_B00000.jsonl
| |-- EN_B00000(即 EN_B00000.tar.gz)
| | |-- EN_B00000_S00000
| | | `-- mp3
| | | |-- EN_B00000_S00000_W000000.mp3
| | | `-- EN_B00000_S00000_W000001.mp3
| | |-- ...
| |-- ...
| |-- EN_B00113.jsonl
| `-- EN_B00113
|-- ZH(92个批次)
|-- DE(9个批次)
|-- FR(10个批次)
|-- JA(7个批次)
|-- KO(4个批次)
JSONL文件示例:
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}
## 引用 📖
若您使用了Emilia数据集或Emilia-Pipe流水线,请引用如下论文:
bibtex
@inproceedings{emilialarge,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
booktitle={arXiv:2501.15907},
year={2025}
}
@inproceedings{emilia,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
booktitle={Proc.~of SLT},
year={2024}
}
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
year={2024}
}
提供机构:
maas
创建时间:
2024-10-28



