emilia
收藏魔搭社区2025-12-05 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/emilia
下载链接
链接失效反馈官方服务:
资源简介:
# Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
<!-- [](https://arxiv.org/abs/2407.05361) [](https://huggingface.co/datasets/amphion/Emilia-Dataset) [](https://opendatalab.com/Amphion/Emilia) [](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) [](https://emilia-dataset.github.io/Emilia-Demo-Page/)
-->
This is the official repository 👑 for the **Emilia** dataset and the source code for the **Emilia-Pipe** speech data preprocessing pipeline.
<div align="center"><img width="500px" src="https://github.com/user-attachments/assets/b1c1a1f8-3149-4f96-8eb4-af470152a9b7" /></div>
## News 🔥
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/ZxxREr3Y) to stay connected and engage with our community!
- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [OpenDataLab](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
- **2024/07/08**: Our preprint [paper](https://arxiv.org/abs/2407.05361) is now available! 🔥🔥🔥
- **2024/07/03**: We welcome everyone to check our [homepage](https://emilia-dataset.github.io/Emilia-Demo-Page/) for our brief introduction for Emilia dataset and our demos!
- **2024/07/01**: We release of Emilia and Emilia-Pipe! We welcome everyone to explore it on our [GitHub](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)! 🎉🎉🎉
## Emilia Overview ⭐️
The **Emilia** dataset is a comprehensive, multilingual dataset with the following features:
- containing over *101k* hours of speech data;
- covering six different languages: *English (En), Chinese (Zh), German (De), French (Fr), Japanese (Ja), and Korean (Ko)*;
- containing diverse speech data with *various speaking styles* from diverse video platforms and podcasts on the Internet, covering various content genres such as talk shows, interviews, debates, sports commentary, and audiobooks.
The table below provides the duration statistics for each language in the dataset.
| Language | Duration (hours) |
|:-----------:|:----------------:|
| English | 46,828 |
| Chinese | 49,922 |
| German | 1,590 |
| French | 1,381 |
| Japanese | 1,715 |
| Korean | 217 |
The **Emilia-Pipe** is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation. This pipeline can process one hour of raw audio into model-ready data in just a few minutes, requiring only the raw speech data.
Detailed descriptions for the Emilia and Emilia-Pipe can be found in our [paper](https://arxiv.org/abs/2407.05361).
## Emilia Dataset Usage 📖
Emilia is publicly available at [HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset).
If you are from mainland China or having a connecting issue with HuggingFace, you can also download Emilia from [OpenDataLab](https://opendatalab.com/Amphion/Emilia).
- To download from HuggingFace:
1. Gain access to the dataset and get the HF access token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
2. Install dependencies and login HF:
- Install Python
- Run `pip install librosa soundfile datasets huggingface_hub[cli]`
- Login by `huggingface-cli login` and paste the HF access token. Check [here](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-login) for details.
3. Use following code to load Emilia:
```py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset", streaming=True)
print(dataset)
print(next(iter(dataset['train'])))
```
- To download from OpenDataLab (i.e., OpenXLab), please follow the guidance [here](https://speechteam.feishu.cn/wiki/PC8Ew5igviqBiJkElMJcJxNonJc) to gain access.
**ENJOY USING EMILIA!!!** 🔥
### Use cases
If you want to load a subset of Emilia, e.g., only language `DE`, you can use the following code:
```py
from datasets import load_dataset
path = "DE/*.tar"
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"de": path}, split="de", streaming=True)
print(dataset) # here should only shows 90 n_shards instead of 2360
print(next(iter(dataset['train'])))
```
If you want to download all files to your local before using Emilia, remove the `streaming=True` argument:
```py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset") # prepare 2.4TB space to store Emilia
print(dataset)
```
### Re-build or Processing your own data
If you wish to re-build Emilia from scratch, you may download the raw audio files from the [provided URL list](https://huggingface.co/datasets/amphion/Emilia) and use our open-source [Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) preprocessing pipeline to preprocess the raw data. Additionally, users can easily use Emilia-Pipe to preprocess their own raw speech data for custom needs. By open-sourcing the Emilia-Pipe code, we aim to enable the speech community to collaborate on large-scale speech generation research.
### Notes
*Please note that Emilia does not own the copyright to the audio files; the copyright remains with the original owners of the videos or audio. Users are permitted to use this dataset only for non-commercial purposes under the CC BY-NC-4.0 license.*
## Emilia Dataset Structure ⛪️
### Structure on HuggingFace
On HuggingFace, Emilia is now formatted as [WebDataset](https://github.com/webdataset/webdataset).
Each audio is tared with a corresponding JSON file (having the same prefix filename) within 2360 tar files.
By utilizing WebDataset, you can easily stream audio data, which is magnitude faster than reading separate data files one by one.
Read the *Emilia Dataset Usage 📖* part for a detailed usage guide.
Learn more about WebDataset [here](https://huggingface.co/docs/hub/datasets-webdataset).
*PS: If you want to download the `OpenDataLab` format from HuggingFace, you can specify the `revision` argument to `fc71e07e8572f5f3be1dbd02ed3172a4d298f152`, [which](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152) is the old format.*
### Structure on OpenDataLab
On OpenDataLab, Emilia is formatted using the following structure.
Structure example:
```
|-- openemilia_all.tar.gz (all .JSONL files are gzipped with directory structure in this file)
|-- EN (114 batches)
| |-- EN_B00000.jsonl
| |-- EN_B00000 (= EN_B00000.tar.gz)
| | |-- EN_B00000_S00000
| | | `-- mp3
| | | |-- EN_B00000_S00000_W000000.mp3
| | | `-- EN_B00000_S00000_W000001.mp3
| | |-- ...
| |-- ...
| |-- EN_B00113.jsonl
| `-- EN_B00113
|-- ZH (92 batches)
|-- DE (9 batches)
|-- FR (10 batches)
|-- JA (7 batches)
|-- KO (4 batches)
```
JSONL files example:
```
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}
```
## Reference 📖
If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the following papers:
```bibtex
@inproceedings{emilia,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
booktitle={Proc.~of SLT},
year={2024}
}
```
```bibtex
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
year={2024}
}
```
# Emilia:面向大规模语音生成的多语言多样化海量语音数据集
<!-- [](https://arxiv.org/abs/2407.05361) [](https://huggingface.co/datasets/amphion/Emilia-Dataset) [](https://opendatalab.com/Amphion/Emilia) [](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) [](https://emilia-dataset.github.io/Emilia-Demo-Page/) -->
这是**Emilia**数据集的官方代码仓库👑,同时也是**Emilia-Pipe**语音数据预处理流水线的源代码库。
<div align="center"><img width="500px" src="https://github.com/user-attachments/assets/b1c1a1f8-3149-4f96-8eb4-af470152a9b7" /></div>
## 新闻 🔥
- **2024/08/28**:欢迎加入Amphion的[Discord频道](https://discord.com/invite/ZxxREr3Y),与社区保持联动交流!
- **2024/08/27**:*Emilia数据集现已正式开源!* 这款规模领先、覆盖多语言且类型多样的语音生成数据集,包含10.1万小时的真实场景语音数据,可通过[HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset)或[OpenDataLab](https://opendatalab.com/Amphion/Emilia)获取!👑👑👑
- **2024/07/08**:我们的预印本[论文](https://arxiv.org/abs/2407.05361)现已上线!🔥🔥🔥
- **2024/07/03**:欢迎访问我们的[项目主页](https://emilia-dataset.github.io/Emilia-Demo-Page/),了解Emilia数据集的简要介绍与演示内容!
- **2024/07/01**:我们正式发布Emilia与Emilia-Pipe!欢迎前往[GitHub](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)探索!🎉🎉🎉
## Emilia 概览 ⭐️
**Emilia**数据集是一套全面的多语言语音数据集,具备以下核心特性:
- 总时长超过**10.1万**小时的语音数据;
- 覆盖六种语言:**英语(En)、中文(Zh)、德语(De)、法语(Fr)、日语(Ja)及韩语(Ko)**;
- 涵盖多样化的语音内容:采集自互联网上各类视频平台与播客,包含多种说话风格,内容类型覆盖脱口秀、访谈、辩论、体育解说及有声书等多个领域。
下表展示了数据集各语言的时长统计:
| 语言 | 时长(小时) |
|:-----------:|:----------------:|
| 英语 | 46,828 |
| 中文 | 49,922 |
| 德语 | 1,590 |
| 法语 | 1,381 |
| 日语 | 1,715 |
| 韩语 | 217 |
**Emilia-Pipe**是全球首个开源的语音数据预处理流水线,旨在将原始真实场景语音数据转换为带高质量标注的语音生成训练数据。该流水线仅需原始语音数据,即可在数分钟内将一小时的原始音频处理为模型可用的标准化训练数据。
详细的Emilia与Emilia-Pipe说明可参考我们的[论文](https://arxiv.org/abs/2407.05361)。
## Emilia 数据集使用指南 📖
Emilia数据集已在[HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset)公开上线。
若您身处中国大陆或无法正常访问HuggingFace,也可通过[OpenDataLab](https://opendatalab.com/Amphion/Emilia)下载Emilia数据集。
### 从HuggingFace下载
1. 前往[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)获取数据集访问权限与HuggingFace访问令牌。
2. 安装依赖并登录HuggingFace:
- 安装Python环境
- 执行命令 `pip install librosa soundfile datasets huggingface_hub[cli]`
- 执行 `huggingface-cli login` 并粘贴获取到的HuggingFace访问令牌,详细说明可参考[此处](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-login)。
3. 使用以下代码加载Emilia数据集:
py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset", streaming=True)
print(dataset)
print(next(iter(dataset['train'])))
### 从OpenDataLab(即OpenXLab)下载,请遵循[此处](https://speechteam.feishu.cn/wiki/PC8Ew5igviqBiJkElMJcJxNonJc)的指引获取访问权限。
**欢迎使用Emilia数据集!!!** 🔥
### 使用场景
若仅需加载Emilia的子集数据,例如仅德语(DE)数据集,可使用以下代码:
py
from datasets import load_dataset
path = "DE/*.tar"
dataset = load_dataset("amphion/Emilia-Dataset", data_files={"de": path}, split="de", streaming=True)
print(dataset) # 此时n_shards应为90而非2360
print(next(iter(dataset['train'])))
若需在使用前将所有文件下载至本地,请移除`streaming=True`参数:
py
from datasets import load_dataset
dataset = load_dataset("amphion/Emilia-Dataset") # 需预留2.4TB存储空间用于存放Emilia数据集
print(dataset)
### 重建或处理自定义数据
若需从头重建Emilia数据集,可从[提供的URL列表](https://huggingface.co/datasets/amphion/Emilia)下载原始音频文件,并使用我们开源的[Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)预处理流水线对原始数据进行处理。此外,用户也可轻松使用Emilia-Pipe处理自定义的原始语音数据以满足个性化需求。我们开源Emilia-Pipe代码的初衷,是推动语音领域社区在大规模语音生成研究方向上开展协作。
### 注意事项
*请注意:Emilia数据集不享有音频文件的版权,版权仍归原视频或音频所有者所有。用户仅可在CC BY-NC-4.0许可协议下将本数据集用于非商业用途。*
## Emilia 数据集结构 ⛪️
### HuggingFace平台上的数据集结构
在HuggingFace平台上,Emilia数据集现已采用[WebDataset](https://github.com/webdataset/webdataset)格式进行组织。
每个音频文件均与对应的JSON文件(文件名前缀一致)打包至2360个tar文件中。
借助WebDataset,您可以便捷地流式加载音频数据,其加载速度远优于逐个读取独立数据文件的方式。
详细的使用指南可参考*Emilia 数据集使用指南 📖*章节。
可通过[此处](https://huggingface.co/docs/hub/datasets-webdataset)了解更多关于WebDataset的信息。
*PS:若需从HuggingFace下载OpenDataLab格式的数据集,可将`revision`参数指定为`fc71e07e8572f5f3be1dbd02ed3172a4d298f152`,该版本的数据集格式可参考[此处](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)。*
### OpenDataLab平台上的数据集结构
在OpenDataLab平台上,Emilia数据集采用以下结构组织:
结构示例:
|-- openemilia_all.tar.gz (该文件中包含所有压缩为gzip格式的JSONL文件与目录结构)
|-- EN(共114个批次)
| |-- EN_B00000.jsonl
| |-- EN_B00000(即 EN_B00000.tar.gz)
| | |-- EN_B00000_S00000
| | | `-- mp3
| | | |-- EN_B00000_S00000_W000000.mp3
| | | `-- EN_B00000_S00000_W000001.mp3
| | |-- ...
| |-- ...
| |-- EN_B00113.jsonl
| `-- EN_B00113
|-- ZH(共92个批次)
|-- DE(共9个批次)
|-- FR(共10个批次)
|-- JA(共7个批次)
|-- KO(共4个批次)
JSONL文件示例:
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}
## 引用 📖
若您在研究中使用Emilia数据集或Emilia-Pipe流水线,请引用以下论文:
bibtex
@inproceedings{emilia,
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
booktitle={Proc.~of SLT},
year={2024}
}
bibtex
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
year={2024}
}
提供机构:
maas
创建时间:
2024-11-28



