X-Voice-Dataset-Train
收藏魔搭社区2026-05-17 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/sunnyxrxrx/X-Voice-Dataset-Train
下载链接
链接失效反馈官方服务:
资源简介:
---
# X-Voice Training Dataset
## Overview
The X-Voice training dataset is a **large-scale multilingual speech corpus** curated for high-performance speech models. It provides a robust foundation for cross-lingual phonetic and prosodic modeling.
Also the train set of [X-Voice Model](https://github.com/sunnyxrxrx/X-Voice).
## Core Statistics
- **Total Speech Duration**: 420K hours
- **30 languages**
- **European**: bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), et (Estonian), fi (Finnish), fr (French), hr (Croatian), hu (Hungarian), it (Italian), lt (Lithuanian), lv (Latvian), mt (Maltese), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sl (Slovenian), sv (Swedish).
- **Asian**: id (Indonesian), ja (Japanese), ko (Korean), th (Thai), vi (Vietnamese), zh (Chinese).
<img src="image-1.png" alt="Duration Statistics of Different Languages" width="420" />
## Data Sources
We aggregate high-quality open-source speech datasets across languages:
- **Chinese & English**: Emilia
- **Vietnamese, Thai, Indonesian**: GigaSpeech 2
- **Korean**: KoreaSpeech
- **Japanese**: ReazonSpeech
- **Russian**: LEMAS
- **European Languages (Spanish, Italian, French, etc.)**: Multilingual Librispeech (MLS), Granary
| Source | Format | Sample Rate |
|---------------|--------|-------------|
| Emilia | mp3 | 24kHz |
| Gigaspeech2 | flac | 16kHz |
| KoreaSpeech | flac | 16kHz |
| ReazonSpeech | flac | 16kHz |
| LEMAS | mp3 | 16kHz |
| MLS | flac | 16kHz |
| Granary | ogg | 16kHz |
> **Format Note**:
> For datasets originally distributed in FLAC format, we **retain the lossless FLAC files in their original state**, without recompressing them into lossy formats like MP3 or OGG.
## Processing Pipeline
A rigorous multi-stage filtering pipeline is applied to ensure data quality:
1. **Duration & Speaking Rate Filtering**: Remove segments <0.5s or >30s; filter by language-specific speaking rate thresholds.
2. **Language Validation**: Verify text language consistency using `langdetect`.
3. **Deduplication**: Remove duplicate texts appearing more than 20 times to avoid overfitting.
4. **Acoustic Quality Control**: Filter low-quality audio via DNSMOS speech quality assessment.
## Highlights
- Diverse linguistic and temporal distribution
- High-quality cleaned speech-text training pairs
- Optimized for multilingual speech modeling and generalization
## Data Structure
```text
X-Voice-Dataset-Train/
├── tars/ # Speech Data
│ ├── bg/
│ │ ├── bg_vox_part001.tar
│ │ ├── bg_vox_part002.tar
│ │ └── ...
| ├── ...
│ └── zh/
│ ├── zh_emilia_part001.tar
│ ├── zh_emilia_part002.tar
│ └── ...
│
├── csvs/ # Tramscript Data
│ ├── metadata_bg_voxpopuli.csv
│ ├── ...
│ └── metadata_zh_emilia.csv
│
└── csvs_stage2/ # Transcript Data for Stage 2 Finetuning
├── metadata_bg_voxpopuli.csv
├── ...
└── metadata_zh_emilia.csv
```
## Use the Dataset
### CLI Download
```
modelscope download --dataset sunnyxrxrx/X-Voice-Dataset-Train --local_dir [local path you want to place the dataset]
```
### Unzip
```bash
cd [local path you want to place the dataset]
for lang in tars/*/; do
lang_name=$(basename "$lang")
mkdir -p "wavs/$lang_name"
tar xf "$lang"*.tar -C "wavs/$lang_name" --strip-components=1 --skip-old-files
done
```
Then you can refer to [X-Voice Training](https://github.com/sunnyxrxrx/X-Voice/blob/main/src/x_voice/train/README.md) for subsequent training process.
## License
**This dataset contains data from multiple sources, each with its own license.**
Users must comply with the license of each individual sub-dataset they use.
| Dataset | License | Commercial Use |
|---|---|---|
| Multilingual LibriSpeech | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes |
| Emilia | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes |
| LEMAS | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes |
| VoxPopuli | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) + [European Parliament's legal notice](https://www.europarl.europa.eu/legal-notice/en/) for the raw data | Yes |
| Granary (MOSEL Part) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes |
| GigaSpeech 2 | [License agreement required](https://huggingface.co/datasets/speechcolab/gigaspeech2) | See terms |
| Reazon Speech | [CDLA-Sharing-1.0 ](https://cdla.dev/sharing-1-0/) + **only for the purpose of [Japanese Copyright Act](https://www.cric.or.jp/english/clj/cl2.html) Article 30-4.** | See terms |
| KoreaSpeech | Refer to the [repo](https://huggingface.co/datasets/jp1924/KoreaSpeech) | |
# X-Voice训练数据集
## 概述
X-Voice训练数据集是一款**大规模多语言语音语料库(large-scale multilingual speech corpus)**,专为高性能语音模型打造,为跨语言语音与韵律建模提供了坚实的基础。同时它也是[X-Voice模型](https://github.com/sunnyxrxrx/X-Voice)的训练集。
## 核心统计数据
- **总语音时长**:42万小时
- **覆盖30种语言**
- **欧洲语言**:bg(保加利亚语)、cs(捷克语)、da(丹麦语)、de(德语)、el(希腊语)、en(英语)、es(西班牙语)、et(爱沙尼亚语)、fi(芬兰语)、fr(法语)、hr(克罗地亚语)、hu(匈牙利语)、it(意大利语)、lt(立陶宛语)、lv(拉脱维亚语)、mt(马耳他语)、nl(荷兰语)、pl(波兰语)、pt(葡萄牙语)、ro(罗马尼亚语)、ru(俄语)、sk(斯洛伐克语)、sl(斯洛文尼亚语)、sv(瑞典语)。
- **亚洲语言**:id(印度尼西亚语)、ja(日语)、ko(韩语)、th(泰语)、vi(越南语)、zh(中文)。
<img src="image-1.png" alt="不同语言的时长统计" width="420" />
## 数据来源
我们聚合了各语言的高质量开源语音数据集:
- **中文与英语**:Emilia
- **越南语、泰语、印度尼西亚语**:GigaSpeech 2
- **韩语**:KoreaSpeech
- **日语**:ReazonSpeech
- **俄语**:LEMAS
- **欧洲语言(西班牙语、意大利语、法语等)**:Multilingual Librispeech(MLS)、Granary
| 数据源 | 格式 | 采样率 |
|---------------|--------|-------------|
| Emilia | mp3 | 24kHz |
| Gigaspeech2 | flac | 16kHz |
| KoreaSpeech | flac | 16kHz |
| ReazonSpeech | flac | 16kHz |
| LEMAS | mp3 | 16kHz |
| MLS | flac | 16kHz |
| Granary | ogg | 16kHz |
> **格式说明**:
> 对于原本以FLAC格式发布的数据集,我们**完整保留其无损FLAC文件,未将其重新压缩为MP3或OGG等有损格式**。
## 处理流程
我们采用严格的多阶段过滤流程以保障数据质量:
1. **时长与语速过滤**:移除时长小于0.5秒或大于30秒的语音片段;依据各语言特定的语速阈值进行过滤。
2. **语言验证**:使用`langdetect`工具验证文本与标注语言的一致性。
3. **去重处理**:移除出现次数超过20次的重复文本,避免模型过拟合。
4. **声学质量管控**:通过DNSMOS语音质量评估模型过滤低质量音频。
## 数据集亮点
- 覆盖多样的语言类型与时间分布
- 经过清洗的高质量语音-文本训练配对样本
- 针对多语言语音建模与泛化能力进行了优化
## 数据组织结构
text
X-Voice-Dataset-Train/
├── tars/ # 语音数据
│ ├── bg/
│ │ ├── bg_vox_part001.tar
│ │ ├── bg_vox_part002.tar
│ │ └── ...
| ├── ...
│ └── zh/
│ ├── zh_emilia_part001.tar
│ ├── zh_emilia_part002.tar
│ └── ...
│
├── csvs/ # 转录数据
│ ├── metadata_bg_voxpopuli.csv
│ ├── ...
│ └── metadata_zh_emilia.csv
│
└── csvs_stage2/ # 第二阶段微调用转录数据
├── metadata_bg_voxpopuli.csv
├── ...
└── metadata_zh_emilia.csv
## 数据集使用
### 命令行(CLI)下载
modelscope download --dataset sunnyxrxrx/X-Voice-Dataset-Train --local_dir [你希望存放数据集的本地路径]
### 数据解压
bash
cd [你希望存放数据集的本地路径]
for lang in tars/*/; do
lang_name=$(basename "$lang")
mkdir -p "wavs/$lang_name"
tar xf "$lang"*.tar -C "wavs/$lang_name" --strip-components=1 --skip-old-files
done
随后你可以参考[X-Voice训练指南](https://github.com/sunnyxrxrx/X-Voice/blob/main/src/x_voice/train/README.md)开展后续训练流程。
## 许可证
**本数据集包含多源数据,各子数据集拥有独立的许可证**。用户必须遵守其所使用的各子数据集的许可证条款。
| 数据集 | 许可证 | 商业使用 |
|---|---|---|
| Multilingual LibriSpeech | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 |
| Emilia | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 |
| LEMAS | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 |
| VoxPopuli | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) + [欧洲议会法律声明](https://www.europarl.europa.eu/legal-notice/en/)(针对原始数据) | 允许 |
| Granary (MOSEL Part) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 |
| GigaSpeech 2 | [需签署许可协议](https://huggingface.co/datasets/speechcolab/gigaspeech2) | 详见协议条款 |
| Reazon Speech | [CDLA-Sharing-1.0](https://cdla.dev/sharing-1-0/) + **仅可用于符合《日本著作权法》第30-4条的使用场景**,相关法律原文可参考[链接](https://www.cric.or.jp/english/clj/cl2.html) | 详见协议条款 |
| KoreaSpeech | 参考[对应仓库](https://huggingface.co/datasets/jp1924/KoreaSpeech) | |
提供机构:
maas
创建时间:
2026-04-19



