five

X-Voice-Dataset-Train

收藏
魔搭社区2026-05-17 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/sunnyxrxrx/X-Voice-Dataset-Train
下载链接
链接失效反馈
官方服务:
资源简介:
--- # X-Voice Training Dataset ## Overview The X-Voice training dataset is a **large-scale multilingual speech corpus** curated for high-performance speech models. It provides a robust foundation for cross-lingual phonetic and prosodic modeling. Also the train set of [X-Voice Model](https://github.com/sunnyxrxrx/X-Voice). ## Core Statistics - **Total Speech Duration**: 420K hours - **30 languages** - **European**: bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), et (Estonian), fi (Finnish), fr (French), hr (Croatian), hu (Hungarian), it (Italian), lt (Lithuanian), lv (Latvian), mt (Maltese), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sl (Slovenian), sv (Swedish). - **Asian**: id (Indonesian), ja (Japanese), ko (Korean), th (Thai), vi (Vietnamese), zh (Chinese). <img src="image-1.png" alt="Duration Statistics of Different Languages" width="420" /> ## Data Sources We aggregate high-quality open-source speech datasets across languages: - **Chinese & English**: Emilia - **Vietnamese, Thai, Indonesian**: GigaSpeech 2 - **Korean**: KoreaSpeech - **Japanese**: ReazonSpeech - **Russian**: LEMAS - **European Languages (Spanish, Italian, French, etc.)**: Multilingual Librispeech (MLS), Granary | Source | Format | Sample Rate | |---------------|--------|-------------| | Emilia | mp3 | 24kHz | | Gigaspeech2 | flac | 16kHz | | KoreaSpeech | flac | 16kHz | | ReazonSpeech | flac | 16kHz | | LEMAS | mp3 | 16kHz | | MLS | flac | 16kHz | | Granary | ogg | 16kHz | > **Format Note**: > For datasets originally distributed in FLAC format, we **retain the lossless FLAC files in their original state**, without recompressing them into lossy formats like MP3 or OGG. ## Processing Pipeline A rigorous multi-stage filtering pipeline is applied to ensure data quality: 1. **Duration & Speaking Rate Filtering**: Remove segments <0.5s or >30s; filter by language-specific speaking rate thresholds. 2. **Language Validation**: Verify text language consistency using `langdetect`. 3. **Deduplication**: Remove duplicate texts appearing more than 20 times to avoid overfitting. 4. **Acoustic Quality Control**: Filter low-quality audio via DNSMOS speech quality assessment. ## Highlights - Diverse linguistic and temporal distribution - High-quality cleaned speech-text training pairs - Optimized for multilingual speech modeling and generalization ## Data Structure ```text X-Voice-Dataset-Train/ ├── tars/ # Speech Data │ ├── bg/ │ │ ├── bg_vox_part001.tar │ │ ├── bg_vox_part002.tar │ │ └── ... | ├── ... │ └── zh/ │ ├── zh_emilia_part001.tar │ ├── zh_emilia_part002.tar │ └── ... │ ├── csvs/ # Tramscript Data │ ├── metadata_bg_voxpopuli.csv │ ├── ... │ └── metadata_zh_emilia.csv │ └── csvs_stage2/ # Transcript Data for Stage 2 Finetuning ├── metadata_bg_voxpopuli.csv ├── ... └── metadata_zh_emilia.csv ``` ## Use the Dataset ### CLI Download ``` modelscope download --dataset sunnyxrxrx/X-Voice-Dataset-Train --local_dir [local path you want to place the dataset] ``` ### Unzip ```bash cd [local path you want to place the dataset] for lang in tars/*/; do lang_name=$(basename "$lang") mkdir -p "wavs/$lang_name" tar xf "$lang"*.tar -C "wavs/$lang_name" --strip-components=1 --skip-old-files done ``` Then you can refer to [X-Voice Training](https://github.com/sunnyxrxrx/X-Voice/blob/main/src/x_voice/train/README.md) for subsequent training process. ## License **This dataset contains data from multiple sources, each with its own license.** Users must comply with the license of each individual sub-dataset they use. | Dataset | License | Commercial Use | |---|---|---| | Multilingual LibriSpeech | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes | | Emilia | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes | | LEMAS | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes | | VoxPopuli | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) + [European Parliament's legal notice](https://www.europarl.europa.eu/legal-notice/en/) for the raw data | Yes | | Granary (MOSEL Part) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Yes | | GigaSpeech 2 | [License agreement required](https://huggingface.co/datasets/speechcolab/gigaspeech2) | See terms | | Reazon Speech | [CDLA-Sharing-1.0 ](https://cdla.dev/sharing-1-0/) + **only for the purpose of [Japanese Copyright Act](https://www.cric.or.jp/english/clj/cl2.html) Article 30-4.** | See terms | | KoreaSpeech | Refer to the [repo](https://huggingface.co/datasets/jp1924/KoreaSpeech) | |

# X-Voice训练数据集 ## 概述 X-Voice训练数据集是一款**大规模多语言语音语料库(large-scale multilingual speech corpus)**,专为高性能语音模型打造,为跨语言语音与韵律建模提供了坚实的基础。同时它也是[X-Voice模型](https://github.com/sunnyxrxrx/X-Voice)的训练集。 ## 核心统计数据 - **总语音时长**:42万小时 - **覆盖30种语言** - **欧洲语言**:bg(保加利亚语)、cs(捷克语)、da(丹麦语)、de(德语)、el(希腊语)、en(英语)、es(西班牙语)、et(爱沙尼亚语)、fi(芬兰语)、fr(法语)、hr(克罗地亚语)、hu(匈牙利语)、it(意大利语)、lt(立陶宛语)、lv(拉脱维亚语)、mt(马耳他语)、nl(荷兰语)、pl(波兰语)、pt(葡萄牙语)、ro(罗马尼亚语)、ru(俄语)、sk(斯洛伐克语)、sl(斯洛文尼亚语)、sv(瑞典语)。 - **亚洲语言**:id(印度尼西亚语)、ja(日语)、ko(韩语)、th(泰语)、vi(越南语)、zh(中文)。 <img src="image-1.png" alt="不同语言的时长统计" width="420" /> ## 数据来源 我们聚合了各语言的高质量开源语音数据集: - **中文与英语**:Emilia - **越南语、泰语、印度尼西亚语**:GigaSpeech 2 - **韩语**:KoreaSpeech - **日语**:ReazonSpeech - **俄语**:LEMAS - **欧洲语言(西班牙语、意大利语、法语等)**:Multilingual Librispeech(MLS)、Granary | 数据源 | 格式 | 采样率 | |---------------|--------|-------------| | Emilia | mp3 | 24kHz | | Gigaspeech2 | flac | 16kHz | | KoreaSpeech | flac | 16kHz | | ReazonSpeech | flac | 16kHz | | LEMAS | mp3 | 16kHz | | MLS | flac | 16kHz | | Granary | ogg | 16kHz | > **格式说明**: > 对于原本以FLAC格式发布的数据集,我们**完整保留其无损FLAC文件,未将其重新压缩为MP3或OGG等有损格式**。 ## 处理流程 我们采用严格的多阶段过滤流程以保障数据质量: 1. **时长与语速过滤**:移除时长小于0.5秒或大于30秒的语音片段;依据各语言特定的语速阈值进行过滤。 2. **语言验证**:使用`langdetect`工具验证文本与标注语言的一致性。 3. **去重处理**:移除出现次数超过20次的重复文本,避免模型过拟合。 4. **声学质量管控**:通过DNSMOS语音质量评估模型过滤低质量音频。 ## 数据集亮点 - 覆盖多样的语言类型与时间分布 - 经过清洗的高质量语音-文本训练配对样本 - 针对多语言语音建模与泛化能力进行了优化 ## 数据组织结构 text X-Voice-Dataset-Train/ ├── tars/ # 语音数据 │ ├── bg/ │ │ ├── bg_vox_part001.tar │ │ ├── bg_vox_part002.tar │ │ └── ... | ├── ... │ └── zh/ │ ├── zh_emilia_part001.tar │ ├── zh_emilia_part002.tar │ └── ... │ ├── csvs/ # 转录数据 │ ├── metadata_bg_voxpopuli.csv │ ├── ... │ └── metadata_zh_emilia.csv │ └── csvs_stage2/ # 第二阶段微调用转录数据 ├── metadata_bg_voxpopuli.csv ├── ... └── metadata_zh_emilia.csv ## 数据集使用 ### 命令行(CLI)下载 modelscope download --dataset sunnyxrxrx/X-Voice-Dataset-Train --local_dir [你希望存放数据集的本地路径] ### 数据解压 bash cd [你希望存放数据集的本地路径] for lang in tars/*/; do lang_name=$(basename "$lang") mkdir -p "wavs/$lang_name" tar xf "$lang"*.tar -C "wavs/$lang_name" --strip-components=1 --skip-old-files done 随后你可以参考[X-Voice训练指南](https://github.com/sunnyxrxrx/X-Voice/blob/main/src/x_voice/train/README.md)开展后续训练流程。 ## 许可证 **本数据集包含多源数据,各子数据集拥有独立的许可证**。用户必须遵守其所使用的各子数据集的许可证条款。 | 数据集 | 许可证 | 商业使用 | |---|---|---| | Multilingual LibriSpeech | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 | | Emilia | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 | | LEMAS | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 | | VoxPopuli | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) + [欧洲议会法律声明](https://www.europarl.europa.eu/legal-notice/en/)(针对原始数据) | 允许 | | Granary (MOSEL Part) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 允许 | | GigaSpeech 2 | [需签署许可协议](https://huggingface.co/datasets/speechcolab/gigaspeech2) | 详见协议条款 | | Reazon Speech | [CDLA-Sharing-1.0](https://cdla.dev/sharing-1-0/) + **仅可用于符合《日本著作权法》第30-4条的使用场景**,相关法律原文可参考[链接](https://www.cric.or.jp/english/clj/cl2.html) | 详见协议条款 | | KoreaSpeech | 参考[对应仓库](https://huggingface.co/datasets/jp1924/KoreaSpeech) | |
提供机构:
maas
创建时间:
2026-04-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作