Reubencf/multilingual-synthetic-tts
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/multilingual-synthetic-tts
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-to-speech
- automatic-speech-recognition
language:
- ja
- de
- ru
- es
- ko
- pt
- zh
- en
- fr
size_categories:
- 10K<n<100K
tags:
- synthetic
- voice-cloning
- qwen3-tts
- multilingual
- tts
pretty_name: Multilingual Synthetic TTS (Qwen3)
---
# Multilingual Synthetic TTS Dataset
> 🏆 **Submitted to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)
> hosted by [Adaption Labs](https://www.adaptionlabs.ai)** — credit to
> **Adaptive Data by Adaption** for organizing the hackathon.
A large-scale **synthetic multilingual speech dataset** — 68,677 clips across
9 languages, generated with [Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)
using zero-shot voice cloning from 5 reference speakers.
Intended for training and evaluating **TTS**, **ASR**, **voice conversion**, and
**multilingual speech** models. Each clip is paired with the ground-truth text
and metadata (language, style, voice).
## Dataset Summary
- **Total clips**: 68,677
- **Languages**: 9
- **Voices**: 5 (zero-shot cloned)
- **Audio format**: WAV, 12 kHz mono
- **Sentence source**: LLM-generated prompts spanning conversational speech,
informational/technical text, emotional utterances, and traditional proverbs
## Languages
| Code | Language | Clips |
|---|---|---|
| `ja` | Japanese | 13,971 |
| `de` | German | 8,998 |
| `ru` | Russian | 8,972 |
| `es` | Spanish | 8,000 |
| `ko` | Korean | 8,000 |
| `pt` | Portuguese | 5,536 |
| `zh` | Mandarin Chinese | 5,531 |
| `en` | English | 5,000 |
| `fr` | French | 4,669 |
## Styles
| Style | Clips |
|---|---|
| `conversational` | 14,860 |
| `informational` | 14,102 |
| `emotional` | 13,378 |
| `technical` | 13,309 |
| `proverbs` | 13,028 |
Styles cover a broad tonal range so the dataset is useful for both neutral TTS
training and expressive voice work.
## Voices
| Voice | Clips |
|---|---|
| `german_woman` | 18,573 |
| `american_boy` | 13,900 |
| `japanese_man` | 12,511 |
| `japanese_woman` | 11,861 |
| `russian_man` | 11,832 |
Each reference voice was used to speak sentences in every language —
demonstrating Qwen3-TTS's cross-lingual voice-cloning capability.
## Schema
| Field | Type | Description |
|---|---|---|
| `audio` | `Audio` | WAV waveform, resampled to 12 kHz by `datasets` |
| `text` | `string` | Ground-truth transcript |
| `language` | `string` | ISO 639-1 code (e.g. `en`, `ja`, `de`) |
| `language_name` | `string` | Full language name |
| `style` | `string` | Speech register / topic (conversational, technical, emotional, proverbs, informational) |
| `voice` | `string` | Reference voice identifier |
| `sample_rate` | `int32` | Source generation rate (native 24 kHz; audio column resamples to 12 kHz) |
## Loading
```python
from datasets import load_dataset
ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train")
print(ds[0])
# Filter by language
ja = ds.filter(lambda x: x["language"] == "ja")
# Iterate audio
for row in ds:
wav = row["audio"]["array"] # numpy float32
sr = row["audio"]["sampling_rate"] # 12000
txt = row["text"]
```
## Generation Pipeline
1. **Sentence generation** — topic-diverse prompts generated by
`gemini-flash-latest`, covering conversational, informational, technical,
emotional, and proverb-style utterances. Translated / localized per target language.
2. **Voice cloning synthesis** — Qwen3-TTS-12Hz-1.7B-Base running on 2× H100
(multi-GPU spawn, batch size 32), with a rotating pool of reference
speakers for cross-lingual cloning.
3. **Metadata** — every clip is written alongside a manifest entry capturing
language, style, voice, and sample rate.
## Intended Uses
- **TTS training / fine-tuning** — broad multilingual coverage with consistent
speaker identities across languages.
- **ASR data augmentation** — synthetic speech with noise-free transcripts.
- **Voice conversion / cloning research** — each voice is represented across
all supported languages, enabling cross-lingual speaker-identity studies.
- **Speech-LM evaluation** — paired (text, audio) supervision in 9 languages.
## Limitations
- **Synthetic voices**: clones of a small reference pool — not demographically
representative.
- **Single acoustic condition**: clean, studio-like. No noise, reverb, or
real-room artifacts.
- **Model-specific artifacts**: occasional mis-pronunciations or prosody
issues inherent to the TTS backbone.
## License
Synthetic audio released for research and non-commercial use. Reference
speakers consented to voice cloning for dataset creation. Users should
comply with the [Qwen3-TTS model license](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)
for downstream applications.
## Citation
If you use this dataset, please cite:
```
@dataset{multilingual_synthetic_tts_2026,
title = {Multilingual Synthetic TTS (Qwen3)},
author = {Fernandes, Reuben},
year = {2026},
url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts}
}
```
task_categories:
- 文本转语音(text-to-speech)
- 自动语音识别(automatic-speech-recognition)
language:
- 日语(ja)
- 德语(de)
- 俄语(ru)
- 西班牙语(es)
- 韩语(ko)
- 葡萄牙语(pt)
- 中文(zh)
- 英语(en)
- 法语(fr)
size_categories:
- 10K<n<100K
tags:
- 合成(synthetic)
- 语音克隆(voice-cloning)
- qwen3-tts
- 多语言(multilingual)
- TTS(文本转语音)
pretty_name: 多语言合成TTS数据集(Qwen3)
# 多语言合成TTS数据集
> 🏆 **已提交至由[Adaption Labs](https://www.adaptionlabs.ai)主办的[未知数据挑战赛(Uncharted Data Challenge)](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**——感谢Adaption Labs旗下的Adaptive Data团队举办本次黑客松赛事。
本数据集为大规模**合成多语言语音数据集**,涵盖9种语言的68677条语音片段,采用[Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)生成,通过5个参考说话人实现零样本(Zero-shot)语音克隆。
本数据集旨在用于训练与评估**TTS(文本转语音)**、**ASR(自动语音识别)**、**语音转换**以及**多语言语音**模型。每条语音片段均配有真实文本与元数据(语言、风格、说话人信息)。
## 数据集概览
- **总片段数**:68677
- **覆盖语言**:9种
- **说话人**:5位(零样本克隆)
- **音频格式**:WAV,12kHz单声道
- **文本来源**:由大语言模型(LLM)生成的提示文本,涵盖对话语音、信息性/技术性文本、情感性语句及传统谚语
## 语言分布
| 语言代码 | 语言名称 | 片段数 |
|---|---|---|
| `ja` | 日语 | 13971 |
| `de` | 德语 | 8998 |
| `ru` | 俄语 | 8972 |
| `es` | 西班牙语 | 8000 |
| `ko` | 韩语 | 8000 |
| `pt` | 葡萄牙语 | 5536 |
| `zh` | 中文(普通话) | 5531 |
| `en` | 英语 | 5000 |
| `fr` | 法语 | 4669 |
## 语音风格
| 风格标签 | 片段数 |
|---|---|
| 对话式(conversational) | 14860 |
| 信息性(informational) | 14102 |
| 情感性(emotional) | 13378 |
| 技术性(technical) | 13309 |
| 谚语(proverbs) | 13028 |
本数据集覆盖了广泛的语调范围,既可用于中性TTS训练与表现力语音合成相关研究。
## 说话人信息
| 说话人标识 | 片段数 |
|---|---|
| `german_woman | 德国女性 | 18573 |
| `american_boy | 美国男孩 | 13900 |
| `japanese_man | 日本男性 | 12511 |
| `japanese_woman | 日本女性 | 11861 |
| `russian_man | 俄罗斯男性 | 11832 |
每位参考说话人可生成所有语言的语句,以此展示Qwen3-TTS的跨语言语音克隆能力。
## 数据结构
| 字段名 | 数据类型 | 字段说明 |
|---|---|---|
| `audio` | `Audio` | WAV波形,通过`datasets`库重采样至12kHz |
| `text` | `string` | 真实转录文本 |
| `language` | `string` | ISO 639-1语言代码(如`en`、`ja`、`de`) |
| `language_name` | `string` | 完整语言名称 |
| `style` | `string` | 语音语体/主题(对话式、技术性、情感性、谚语、信息性) |
| `voice` | `string` | 参考说话人标识 |
| `sample_rate` | `int32` | 原始生成采样率(原生24kHz;`audio`字段重采样至12kHz) |
## 数据集加载
python
from datasets import load_dataset
ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train")
print(ds[0])
# 按语言过滤
ja_ds = ds.filter(lambda x: x["language"] == "ja")
# 遍历音频数据
for row in ds:
wav = row["audio"]["array"] # numpy float32 格式的音频数组
sr = row["audio"]["sampling_rate"] # 采样率为12000
txt = row["text"]
## 生成流程
1. **文本生成**——由`gemini-flash-latest`生成主题多样的提示文本,涵盖对话式、信息性、技术性、情感性及谚语风格的语句,并针对目标语言进行翻译与本地化处理。
2. **语音克隆合成**——使用2台H100 GPU(多GPU并行,批次大小32)运行Qwen3-TTS-12Hz-1.7B-Base模型,通过轮换参考说话人池实现跨语言克隆。
3. **元数据生成**——每条语音片段均附带清单条目,记录语言、风格、说话人及采样率信息。
## 预期用途
- **TTS训练与微调**——覆盖9种语言的广泛多语言场景,且各语言间保持一致的说话人身份。
- **ASR数据增强**——带有精准转录文本的合成语音。
- **语音转换/克隆研究**——每位说话人在所有支持语言中均有覆盖,可用于跨语言说话人身份相关研究。
- **语音语言模型(Speech-LM)评估**——涵盖9种语言的(文本、音频)配对监督数据。
## 数据集局限性
- **合成语音局限性**:参考说话人池规模较小,不具备人口统计学代表性。
- **单一声学条件**:仅包含干净、演播室级别的语音,无噪声、混响或真实房间声学伪影。
- **模型固有伪影**:存在TTS主干模型固有的偶尔发音错误或韵律问题。
## 授权协议
本合成语音仅供研究与非商业用途。参考说话人已同意为数据集创建进行语音克隆。用户需遵守[Qwen3-TTS模型授权协议](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)以进行下游应用。
## 引用信息
若您使用本数据集,请引用:
bibtex
@dataset{multilingual_synthetic_tts_2026,
title = {Multilingual Synthetic TTS (Qwen3)},
author = {Fernandes, Reuben},
year = {2026},
url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts}
}
提供机构:
Reubencf



