Reubencf/multilingual-synthetic-tts

Name: Reubencf/multilingual-synthetic-tts
Creator: Reubencf
Published: 2026-04-15 10:02:44
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Reubencf/multilingual-synthetic-tts

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-to-speech - automatic-speech-recognition language: - ja - de - ru - es - ko - pt - zh - en - fr size_categories: - 10K<n<100K tags: - synthetic - voice-cloning - qwen3-tts - multilingual - tts pretty_name: Multilingual Synthetic TTS (Qwen3) --- # Multilingual Synthetic TTS Dataset > 🏆 **Submitted to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) > hosted by [Adaption Labs](https://www.adaptionlabs.ai)** — credit to > **Adaptive Data by Adaption** for organizing the hackathon. A large-scale **synthetic multilingual speech dataset** — 68,677 clips across 9 languages, generated with [Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) using zero-shot voice cloning from 5 reference speakers. Intended for training and evaluating **TTS**, **ASR**, **voice conversion**, and **multilingual speech** models. Each clip is paired with the ground-truth text and metadata (language, style, voice). ## Dataset Summary - **Total clips**: 68,677 - **Languages**: 9 - **Voices**: 5 (zero-shot cloned) - **Audio format**: WAV, 12 kHz mono - **Sentence source**: LLM-generated prompts spanning conversational speech, informational/technical text, emotional utterances, and traditional proverbs ## Languages | Code | Language | Clips | |---|---|---| | `ja` | Japanese | 13,971 | | `de` | German | 8,998 | | `ru` | Russian | 8,972 | | `es` | Spanish | 8,000 | | `ko` | Korean | 8,000 | | `pt` | Portuguese | 5,536 | | `zh` | Mandarin Chinese | 5,531 | | `en` | English | 5,000 | | `fr` | French | 4,669 | ## Styles | Style | Clips | |---|---| | `conversational` | 14,860 | | `informational` | 14,102 | | `emotional` | 13,378 | | `technical` | 13,309 | | `proverbs` | 13,028 | Styles cover a broad tonal range so the dataset is useful for both neutral TTS training and expressive voice work. ## Voices | Voice | Clips | |---|---| | `german_woman` | 18,573 | | `american_boy` | 13,900 | | `japanese_man` | 12,511 | | `japanese_woman` | 11,861 | | `russian_man` | 11,832 | Each reference voice was used to speak sentences in every language — demonstrating Qwen3-TTS's cross-lingual voice-cloning capability. ## Schema | Field | Type | Description | |---|---|---| | `audio` | `Audio` | WAV waveform, resampled to 12 kHz by `datasets` | | `text` | `string` | Ground-truth transcript | | `language` | `string` | ISO 639-1 code (e.g. `en`, `ja`, `de`) | | `language_name` | `string` | Full language name | | `style` | `string` | Speech register / topic (conversational, technical, emotional, proverbs, informational) | | `voice` | `string` | Reference voice identifier | | `sample_rate` | `int32` | Source generation rate (native 24 kHz; audio column resamples to 12 kHz) | ## Loading ```python from datasets import load_dataset ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train") print(ds[0]) # Filter by language ja = ds.filter(lambda x: x["language"] == "ja") # Iterate audio for row in ds: wav = row["audio"]["array"] # numpy float32 sr = row["audio"]["sampling_rate"] # 12000 txt = row["text"] ``` ## Generation Pipeline 1. **Sentence generation** — topic-diverse prompts generated by `gemini-flash-latest`, covering conversational, informational, technical, emotional, and proverb-style utterances. Translated / localized per target language. 2. **Voice cloning synthesis** — Qwen3-TTS-12Hz-1.7B-Base running on 2× H100 (multi-GPU spawn, batch size 32), with a rotating pool of reference speakers for cross-lingual cloning. 3. **Metadata** — every clip is written alongside a manifest entry capturing language, style, voice, and sample rate. ## Intended Uses - **TTS training / fine-tuning** — broad multilingual coverage with consistent speaker identities across languages. - **ASR data augmentation** — synthetic speech with noise-free transcripts. - **Voice conversion / cloning research** — each voice is represented across all supported languages, enabling cross-lingual speaker-identity studies. - **Speech-LM evaluation** — paired (text, audio) supervision in 9 languages. ## Limitations - **Synthetic voices**: clones of a small reference pool — not demographically representative. - **Single acoustic condition**: clean, studio-like. No noise, reverb, or real-room artifacts. - **Model-specific artifacts**: occasional mis-pronunciations or prosody issues inherent to the TTS backbone. ## License Synthetic audio released for research and non-commercial use. Reference speakers consented to voice cloning for dataset creation. Users should comply with the [Qwen3-TTS model license](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) for downstream applications. ## Citation If you use this dataset, please cite: ``` @dataset{multilingual_synthetic_tts_2026, title = {Multilingual Synthetic TTS (Qwen3)}, author = {Fernandes, Reuben}, year = {2026}, url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts} } ```

task_categories: - 文本转语音（text-to-speech） - 自动语音识别（automatic-speech-recognition） language: - 日语（ja） - 德语（de） - 俄语（ru） - 西班牙语（es） - 韩语（ko） - 葡萄牙语（pt） - 中文（zh） - 英语（en） - 法语（fr） size_categories: - 10K<n<100K tags: - 合成（synthetic） - 语音克隆（voice-cloning） - qwen3-tts - 多语言（multilingual） - TTS（文本转语音） pretty_name: 多语言合成TTS数据集（Qwen3） # 多语言合成TTS数据集 > 🏆 **已提交至由[Adaption Labs](https://www.adaptionlabs.ai)主办的[未知数据挑战赛（Uncharted Data Challenge）](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**——感谢Adaption Labs旗下的Adaptive Data团队举办本次黑客松赛事。本数据集为大规模**合成多语言语音数据集**，涵盖9种语言的68677条语音片段，采用[Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)生成，通过5个参考说话人实现零样本（Zero-shot）语音克隆。本数据集旨在用于训练与评估**TTS（文本转语音）**、**ASR（自动语音识别）**、**语音转换**以及**多语言语音**模型。每条语音片段均配有真实文本与元数据（语言、风格、说话人信息）。 ## 数据集概览 - **总片段数**：68677 - **覆盖语言**：9种 - **说话人**：5位（零样本克隆） - **音频格式**：WAV，12kHz单声道 - **文本来源**：由大语言模型（LLM）生成的提示文本，涵盖对话语音、信息性/技术性文本、情感性语句及传统谚语 ## 语言分布 | 语言代码 | 语言名称 | 片段数 | |---|---|---| | `ja` | 日语 | 13971 | | `de` | 德语 | 8998 | | `ru` | 俄语 | 8972 | | `es` | 西班牙语 | 8000 | | `ko` | 韩语 | 8000 | | `pt` | 葡萄牙语 | 5536 | | `zh` | 中文（普通话） | 5531 | | `en` | 英语 | 5000 | | `fr` | 法语 | 4669 | ## 语音风格 | 风格标签 | 片段数 | |---|---| | 对话式（conversational） | 14860 | | 信息性（informational） | 14102 | | 情感性（emotional） | 13378 | | 技术性（technical） | 13309 | | 谚语（proverbs） | 13028 | 本数据集覆盖了广泛的语调范围，既可用于中性TTS训练与表现力语音合成相关研究。 ## 说话人信息 | 说话人标识 | 片段数 | |---|---| | `german_woman | 德国女性 | 18573 | | `american_boy | 美国男孩 | 13900 | | `japanese_man | 日本男性 | 12511 | | `japanese_woman | 日本女性 | 11861 | | `russian_man | 俄罗斯男性 | 11832 | 每位参考说话人可生成所有语言的语句，以此展示Qwen3-TTS的跨语言语音克隆能力。 ## 数据结构 | 字段名 | 数据类型 | 字段说明 | |---|---|---| | `audio` | `Audio` | WAV波形，通过`datasets`库重采样至12kHz | | `text` | `string` | 真实转录文本 | | `language` | `string` | ISO 639-1语言代码（如`en`、`ja`、`de`） | | `language_name` | `string` | 完整语言名称 | | `style` | `string` | 语音语体/主题（对话式、技术性、情感性、谚语、信息性） | | `voice` | `string` | 参考说话人标识 | | `sample_rate` | `int32` | 原始生成采样率（原生24kHz；`audio`字段重采样至12kHz） | ## 数据集加载 python from datasets import load_dataset ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train") print(ds[0]) # 按语言过滤 ja_ds = ds.filter(lambda x: x["language"] == "ja") # 遍历音频数据 for row in ds: wav = row["audio"]["array"] # numpy float32 格式的音频数组 sr = row["audio"]["sampling_rate"] # 采样率为12000 txt = row["text"] ## 生成流程 1. **文本生成**——由`gemini-flash-latest`生成主题多样的提示文本，涵盖对话式、信息性、技术性、情感性及谚语风格的语句，并针对目标语言进行翻译与本地化处理。 2. **语音克隆合成**——使用2台H100 GPU（多GPU并行，批次大小32）运行Qwen3-TTS-12Hz-1.7B-Base模型，通过轮换参考说话人池实现跨语言克隆。 3. **元数据生成**——每条语音片段均附带清单条目，记录语言、风格、说话人及采样率信息。 ## 预期用途 - **TTS训练与微调**——覆盖9种语言的广泛多语言场景，且各语言间保持一致的说话人身份。 - **ASR数据增强**——带有精准转录文本的合成语音。 - **语音转换/克隆研究**——每位说话人在所有支持语言中均有覆盖，可用于跨语言说话人身份相关研究。 - **语音语言模型（Speech-LM）评估**——涵盖9种语言的（文本、音频）配对监督数据。 ## 数据集局限性 - **合成语音局限性**：参考说话人池规模较小，不具备人口统计学代表性。 - **单一声学条件**：仅包含干净、演播室级别的语音，无噪声、混响或真实房间声学伪影。 - **模型固有伪影**：存在TTS主干模型固有的偶尔发音错误或韵律问题。 ## 授权协议本合成语音仅供研究与非商业用途。参考说话人已同意为数据集创建进行语音克隆。用户需遵守[Qwen3-TTS模型授权协议](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)以进行下游应用。 ## 引用信息若您使用本数据集，请引用： bibtex @dataset{multilingual_synthetic_tts_2026, title = {Multilingual Synthetic TTS (Qwen3)}, author = {Fernandes, Reuben}, year = {2026}, url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts} }

提供机构：

Reubencf

5,000+

优质数据集

54 个

任务类型

进入经典数据集