five

Reubencf/multilingual-synthetic-tts

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/multilingual-synthetic-tts
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-to-speech - automatic-speech-recognition language: - ja - de - ru - es - ko - pt - zh - en - fr size_categories: - 10K<n<100K tags: - synthetic - voice-cloning - qwen3-tts - multilingual - tts pretty_name: Multilingual Synthetic TTS (Qwen3) --- # Multilingual Synthetic TTS Dataset > 🏆 **Submitted to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) > hosted by [Adaption Labs](https://www.adaptionlabs.ai)** — credit to > **Adaptive Data by Adaption** for organizing the hackathon. A large-scale **synthetic multilingual speech dataset** — 68,677 clips across 9 languages, generated with [Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) using zero-shot voice cloning from 5 reference speakers. Intended for training and evaluating **TTS**, **ASR**, **voice conversion**, and **multilingual speech** models. Each clip is paired with the ground-truth text and metadata (language, style, voice). ## Dataset Summary - **Total clips**: 68,677 - **Languages**: 9 - **Voices**: 5 (zero-shot cloned) - **Audio format**: WAV, 12 kHz mono - **Sentence source**: LLM-generated prompts spanning conversational speech, informational/technical text, emotional utterances, and traditional proverbs ## Languages | Code | Language | Clips | |---|---|---| | `ja` | Japanese | 13,971 | | `de` | German | 8,998 | | `ru` | Russian | 8,972 | | `es` | Spanish | 8,000 | | `ko` | Korean | 8,000 | | `pt` | Portuguese | 5,536 | | `zh` | Mandarin Chinese | 5,531 | | `en` | English | 5,000 | | `fr` | French | 4,669 | ## Styles | Style | Clips | |---|---| | `conversational` | 14,860 | | `informational` | 14,102 | | `emotional` | 13,378 | | `technical` | 13,309 | | `proverbs` | 13,028 | Styles cover a broad tonal range so the dataset is useful for both neutral TTS training and expressive voice work. ## Voices | Voice | Clips | |---|---| | `german_woman` | 18,573 | | `american_boy` | 13,900 | | `japanese_man` | 12,511 | | `japanese_woman` | 11,861 | | `russian_man` | 11,832 | Each reference voice was used to speak sentences in every language — demonstrating Qwen3-TTS's cross-lingual voice-cloning capability. ## Schema | Field | Type | Description | |---|---|---| | `audio` | `Audio` | WAV waveform, resampled to 12 kHz by `datasets` | | `text` | `string` | Ground-truth transcript | | `language` | `string` | ISO 639-1 code (e.g. `en`, `ja`, `de`) | | `language_name` | `string` | Full language name | | `style` | `string` | Speech register / topic (conversational, technical, emotional, proverbs, informational) | | `voice` | `string` | Reference voice identifier | | `sample_rate` | `int32` | Source generation rate (native 24 kHz; audio column resamples to 12 kHz) | ## Loading ```python from datasets import load_dataset ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train") print(ds[0]) # Filter by language ja = ds.filter(lambda x: x["language"] == "ja") # Iterate audio for row in ds: wav = row["audio"]["array"] # numpy float32 sr = row["audio"]["sampling_rate"] # 12000 txt = row["text"] ``` ## Generation Pipeline 1. **Sentence generation** — topic-diverse prompts generated by `gemini-flash-latest`, covering conversational, informational, technical, emotional, and proverb-style utterances. Translated / localized per target language. 2. **Voice cloning synthesis** — Qwen3-TTS-12Hz-1.7B-Base running on 2× H100 (multi-GPU spawn, batch size 32), with a rotating pool of reference speakers for cross-lingual cloning. 3. **Metadata** — every clip is written alongside a manifest entry capturing language, style, voice, and sample rate. ## Intended Uses - **TTS training / fine-tuning** — broad multilingual coverage with consistent speaker identities across languages. - **ASR data augmentation** — synthetic speech with noise-free transcripts. - **Voice conversion / cloning research** — each voice is represented across all supported languages, enabling cross-lingual speaker-identity studies. - **Speech-LM evaluation** — paired (text, audio) supervision in 9 languages. ## Limitations - **Synthetic voices**: clones of a small reference pool — not demographically representative. - **Single acoustic condition**: clean, studio-like. No noise, reverb, or real-room artifacts. - **Model-specific artifacts**: occasional mis-pronunciations or prosody issues inherent to the TTS backbone. ## License Synthetic audio released for research and non-commercial use. Reference speakers consented to voice cloning for dataset creation. Users should comply with the [Qwen3-TTS model license](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) for downstream applications. ## Citation If you use this dataset, please cite: ``` @dataset{multilingual_synthetic_tts_2026, title = {Multilingual Synthetic TTS (Qwen3)}, author = {Fernandes, Reuben}, year = {2026}, url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts} } ```

task_categories: - 文本转语音(text-to-speech) - 自动语音识别(automatic-speech-recognition) language: - 日语(ja) - 德语(de) - 俄语(ru) - 西班牙语(es) - 韩语(ko) - 葡萄牙语(pt) - 中文(zh) - 英语(en) - 法语(fr) size_categories: - 10K<n<100K tags: - 合成(synthetic) - 语音克隆(voice-cloning) - qwen3-tts - 多语言(multilingual) - TTS(文本转语音) pretty_name: 多语言合成TTS数据集(Qwen3) # 多语言合成TTS数据集 > 🏆 **已提交至由[Adaption Labs](https://www.adaptionlabs.ai)主办的[未知数据挑战赛(Uncharted Data Challenge)](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**——感谢Adaption Labs旗下的Adaptive Data团队举办本次黑客松赛事。 本数据集为大规模**合成多语言语音数据集**,涵盖9种语言的68677条语音片段,采用[Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)生成,通过5个参考说话人实现零样本(Zero-shot)语音克隆。 本数据集旨在用于训练与评估**TTS(文本转语音)**、**ASR(自动语音识别)**、**语音转换**以及**多语言语音**模型。每条语音片段均配有真实文本与元数据(语言、风格、说话人信息)。 ## 数据集概览 - **总片段数**:68677 - **覆盖语言**:9种 - **说话人**:5位(零样本克隆) - **音频格式**:WAV,12kHz单声道 - **文本来源**:由大语言模型(LLM)生成的提示文本,涵盖对话语音、信息性/技术性文本、情感性语句及传统谚语 ## 语言分布 | 语言代码 | 语言名称 | 片段数 | |---|---|---| | `ja` | 日语 | 13971 | | `de` | 德语 | 8998 | | `ru` | 俄语 | 8972 | | `es` | 西班牙语 | 8000 | | `ko` | 韩语 | 8000 | | `pt` | 葡萄牙语 | 5536 | | `zh` | 中文(普通话) | 5531 | | `en` | 英语 | 5000 | | `fr` | 法语 | 4669 | ## 语音风格 | 风格标签 | 片段数 | |---|---| | 对话式(conversational) | 14860 | | 信息性(informational) | 14102 | | 情感性(emotional) | 13378 | | 技术性(technical) | 13309 | | 谚语(proverbs) | 13028 | 本数据集覆盖了广泛的语调范围,既可用于中性TTS训练与表现力语音合成相关研究。 ## 说话人信息 | 说话人标识 | 片段数 | |---|---| | `german_woman | 德国女性 | 18573 | | `american_boy | 美国男孩 | 13900 | | `japanese_man | 日本男性 | 12511 | | `japanese_woman | 日本女性 | 11861 | | `russian_man | 俄罗斯男性 | 11832 | 每位参考说话人可生成所有语言的语句,以此展示Qwen3-TTS的跨语言语音克隆能力。 ## 数据结构 | 字段名 | 数据类型 | 字段说明 | |---|---|---| | `audio` | `Audio` | WAV波形,通过`datasets`库重采样至12kHz | | `text` | `string` | 真实转录文本 | | `language` | `string` | ISO 639-1语言代码(如`en`、`ja`、`de`) | | `language_name` | `string` | 完整语言名称 | | `style` | `string` | 语音语体/主题(对话式、技术性、情感性、谚语、信息性) | | `voice` | `string` | 参考说话人标识 | | `sample_rate` | `int32` | 原始生成采样率(原生24kHz;`audio`字段重采样至12kHz) | ## 数据集加载 python from datasets import load_dataset ds = load_dataset("Reubencf/multilingual-synthetic-tts", split="train") print(ds[0]) # 按语言过滤 ja_ds = ds.filter(lambda x: x["language"] == "ja") # 遍历音频数据 for row in ds: wav = row["audio"]["array"] # numpy float32 格式的音频数组 sr = row["audio"]["sampling_rate"] # 采样率为12000 txt = row["text"] ## 生成流程 1. **文本生成**——由`gemini-flash-latest`生成主题多样的提示文本,涵盖对话式、信息性、技术性、情感性及谚语风格的语句,并针对目标语言进行翻译与本地化处理。 2. **语音克隆合成**——使用2台H100 GPU(多GPU并行,批次大小32)运行Qwen3-TTS-12Hz-1.7B-Base模型,通过轮换参考说话人池实现跨语言克隆。 3. **元数据生成**——每条语音片段均附带清单条目,记录语言、风格、说话人及采样率信息。 ## 预期用途 - **TTS训练与微调**——覆盖9种语言的广泛多语言场景,且各语言间保持一致的说话人身份。 - **ASR数据增强**——带有精准转录文本的合成语音。 - **语音转换/克隆研究**——每位说话人在所有支持语言中均有覆盖,可用于跨语言说话人身份相关研究。 - **语音语言模型(Speech-LM)评估**——涵盖9种语言的(文本、音频)配对监督数据。 ## 数据集局限性 - **合成语音局限性**:参考说话人池规模较小,不具备人口统计学代表性。 - **单一声学条件**:仅包含干净、演播室级别的语音,无噪声、混响或真实房间声学伪影。 - **模型固有伪影**:存在TTS主干模型固有的偶尔发音错误或韵律问题。 ## 授权协议 本合成语音仅供研究与非商业用途。参考说话人已同意为数据集创建进行语音克隆。用户需遵守[Qwen3-TTS模型授权协议](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)以进行下游应用。 ## 引用信息 若您使用本数据集,请引用: bibtex @dataset{multilingual_synthetic_tts_2026, title = {Multilingual Synthetic TTS (Qwen3)}, author = {Fernandes, Reuben}, year = {2026}, url = {https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts} }
提供机构:
Reubencf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作