KRAFTON/Raon-OpenTTS-Eval
收藏Hugging Face2026-04-08 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/KRAFTON/Raon-OpenTTS-Eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-nd-4.0
tags:
- tts
- text-to-speech
- zero-shot-tts
- evaluation
- benchmark
- speech
- robustness
- audio
- english
pretty_name: Raon-OpenTTS-Eval
size_categories:
- 1K<n<10K
task_categories:
- text-to-speech
language:
- en
configs:
- config_name: default
data_files:
- split: clean
path: clean/metadata.csv
- split: noisy
path: noisy/metadata.csv
- split: wild
path: wild/metadata.csv
- split: expressive
path: expressive/metadata.csv
---
# Raon-OpenTTS-Eval
<div align="center">
<img class="block dark:hidden" src="assets/Raon-OpenTTS-Gradient-Black.png" alt="Raon OpenTTS" width="600">
<img class="hidden dark:block" src="assets/Raon-OpenTTS-Gradient-White.png" alt="Raon OpenTTS" width="600">
</div>
<p align="center">
<a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a>
<a href="https://github.com/krafton-ai/Raon-OpenTTS"><img src="https://img.shields.io/badge/GitHub-Raon--OpenTTS-white?style=flat&logo=github&logoColor=black" alt="GitHub"></a>
<a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a>
<a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a>
<a href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><img src="https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey?style=flat" alt="License"></a>
</p>
<p align="center">
Technical Report (Coming soon)
</p>
A robustness-oriented evaluation benchmark for zero-shot text-to-speech, covering **4 acoustic regimes** (Clean, Noisy, Wild, Expressive) across **12 datasets** with **6,000 prompt–text pairs**.
Existing zero-shot TTS benchmarks typically evaluate models using prompts drawn from a single read-speech dataset, providing an incomplete view of robustness under realistic and challenging recording scenarios. Raon-OpenTTS-Eval addresses this by sampling prompts from diverse real-world conditions, enabling systematic analysis of TTS robustness across controlled, noisy, conversational, and expressive speech.
## Dataset Structure
```
Raon-OpenTTS-Eval/
├── clean/
│ ├── metadata.csv # 2,500 pairs
│ └── audio/ # reference (prompt) WAV files
├── noisy/
│ ├── metadata.csv # 1,000 pairs
│ └── audio/
├── wild/
│ ├── metadata.csv # 1,000 pairs
│ └── audio/
└── expressive/
├── metadata.csv # 1,500 pairs
└── audio/
```
Each `metadata.csv` has the following columns:
| Column | Description |
|--------|-------------|
| `category` | Acoustic regime (CLEAN / NOISY / WILD / EXPRESSIVE) |
| `source` | Source dataset name |
| `ref_id` | Prompt utterance ID |
| `ref_dur` | Prompt duration (seconds) |
| `ref_text` | Prompt transcription (condition for zero-shot TTS) |
| `gen_id` | Target utterance ID (used to name generated wav) |
| `gen_dur` | Target duration (seconds) |
| `gen_text` | Target text to synthesize |
| `ref_audio` | Relative path to prompt WAV (`audio/{filename}.wav`) |
## Construction
For each source dataset, 500 utterances are selected as speech prompts via **stratified sampling by speaker metadata** (emotion, dialect, speaking style) to ensure representative coverage. Each prompt is paired with a target text drawn from a **disjoint utterance in the same dataset**, resulting in cross-sentence pairs.
For **AMI-SDM**, a substantial number of segments contain noisy or misaligned transcriptions due to distant microphone recording conditions. To ensure reliable evaluation, only segments with **zero WER** (as estimated by Whisper) are retained before sampling, filtering out samples with severe transcription mismatches.
## Quick Start: Evaluation
### 1. Install dependencies
```bash
pip install faster-whisper whisper-normalizer jiwer torchaudio soundfile torch
```
### 2. Download the WavLM speaker verification checkpoint
The SIM metric uses an ECAPA-TDNN model with WavLM-large features, finetuned for speaker verification. Download the checkpoint from [UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification):
```bash
# Direct download
wget https://github.com/microsoft/UniSpeech/releases/download/v1.0.0/wavlm_large_finetune.pth
```
> `ecapa_tdnn.py` (included in this repository) must be in the same directory as `eval_raon_tts.py` when running evaluation.
### 3. Generate audio
For each row in `metadata.csv`, synthesize `gen_text` conditioned on the prompt audio at `ref_audio`. Save the output as `{gen_id}.wav` in a flat directory.
```python
for row in metadata:
wav = your_tts_model.synthesize(
text=row["gen_text"],
prompt_audio=f"{split_dir}/{row['ref_audio']}",
prompt_text=row["ref_text"],
)
save_wav(wav, f"{output_dir}/{row['gen_id']}.wav")
```
### 4. Run evaluation
```bash
python eval_raon_tts.py \
--gen_dir /path/to/generated_wavs \
--dataset_dir /path/to/Raon-OpenTTS-Eval \
--wavlm_ckpt /path/to/wavlm_large_finetune.pth
```
`--gen_dir` accepts two layouts:
| Layout | Expected structure |
|--------|-------------------|
| **Flat** | `gen_dir/{gen_id}.wav` |
| **Per-split** | `gen_dir/{split}/wavs/{gen_id}.wav` |
Split names recognized: `clean` / `raon-clean`, `noisy` / `raon-noisy`, `wild` / `raon-wild`, `expressive` / `raon-emo`.
### 5. Output
```
RESULTS SUMMARY
==================================================
clean WER=0.0199 SIM=0.6793
noisy WER=0.0341 SIM=0.6969
wild WER=0.0641 SIM=0.6017
expressive WER=0.0117 SIM=0.6020
overall WER=0.0300 SIM=0.6505
==================================================
Results saved to: /path/to/generated_wavs/raon_eval_results.json
```
**Metrics:**
- **WER** — Word Error Rate computed by transcribing generated audio with Whisper-large-v3 and normalizing via `EnglishTextNormalizer` (avoids penalizing surface-form variants such as numeric expressions or hyphenated compounds)
- **SIM** — Cosine speaker similarity between generated and prompt audio using WavLM-large finetuned for speaker verification
## Baseline Results
TBD
## Splits
### CLEAN (2,500 pairs)
Controlled read speech from studio and clean recording conditions.
| Source | Pairs | License |
|--------|------:|---------|
| LibriSpeech-clean | 500 | CC BY 4.0 |
| ST American English | 500 | CC BY-NC-ND 4.0 |
| CMU-Arctic | 500 | BSD |
| L2-ARCTIC | 500 | CC BY-NC 4.0 |
| VCTK | 500 | CC BY 4.0 |
### NOISY (1,000 pairs)
Read and prompted speech in the presence of background noise or reverberation.
| Source | Pairs | License |
|--------|------:|---------|
| LibriSpeech-other | 500 | CC BY 4.0 |
| TED-LIUM 3 | 500 | CC BY-NC-ND 4.0 |
### WILD (1,000 pairs)
Unscripted conversational speech from real-world meetings captured under natural conditions. AMI-SDM samples are filtered to WER=0 to ensure transcription reliability.
| Source | Pairs | License |
|--------|------:|---------|
| AMI-IHM | 500 | CC BY 4.0 |
| AMI-SDM | 500 | CC BY 4.0 |
### EXPRESSIVE (1,500 pairs)
Expressive speech covering a wide range of emotions and prosodic styles.
| Source | Pairs | License |
|--------|------:|---------|
| CREMA-D | 500 | ODbL 1.0 |
| EmoV-DB | 500 | Non-commercial research |
| Expresso | 500 | CC BY-NC 4.0 |
## Licenses
This dataset is a compilation of audio excerpts from multiple sources, each retaining its original license. The overall dataset is released under **CC BY-NC-ND 4.0** (the most restrictive license among the included sources). See the per-source table above for individual licenses.
| License | Sources |
|---------|---------|
| CC BY 4.0 | LibriSpeech-clean, LibriSpeech-other, VCTK, AMI-IHM, AMI-SDM |
| CC BY-NC 4.0 | L2-ARCTIC, Expresso |
| CC BY-NC-ND 4.0 | ST American English, TED-LIUM 3 |
| BSD | CMU-Arctic |
| ODbL 1.0 | CREMA-D |
| Non-commercial research | EmoV-DB |
## Citation
```bibtex
@article{raon2026opentts,
title = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
author = {TBD},
year = {2026},
url = {https://github.com/krafton-ai/Raon-OpenTTS}
}
```
© 2026 KRAFTON
提供机构:
KRAFTON
搜集汇总
数据集介绍

构建方式
在语音合成领域,评估零样本文本转语音模型的鲁棒性至关重要。Raon-OpenTTS-Eval数据集的构建采用了分层抽样策略,依据说话人的元数据如情感、方言和说话风格,从12个不同来源数据集中选取了500个话语作为语音提示,确保了样本的代表性覆盖。每个提示与同一数据集中不相交的目标文本配对,形成跨句对。特别地,对于AMI-SDM数据集,通过Whisper模型估计词错误率为零的筛选,剔除了因远场录音条件导致的转录不准确样本,从而保证了评估的可靠性。最终,数据集涵盖了清洁、嘈杂、自然和表达性四种声学环境,共计6000个提示-文本对。
特点
该数据集的核心特点在于其多维度的鲁棒性评估框架。它系统性地整合了四种声学环境:清洁环境下的朗读语音、带有背景噪声或混响的语音、自然对话语音以及富含情感和韵律变化的表达性语音。这种设计突破了传统零样本TTS基准仅依赖单一朗读数据集的局限,为模型在真实复杂场景下的性能提供了全面视角。数据集包含6000个精心配对的样本,每个样本均标注了提示音频、目标文本及元数据,支持对语音合成模型的词错误率和说话人相似度等关键指标进行标准化评估。
使用方法
使用Raon-OpenTTS-Eval进行评估时,首先需安装必要的依赖库并下载预训练的WavLM说话人验证检查点。随后,依据数据集中metadata.csv文件提供的目标文本和提示音频路径,利用待评估的TTS模型合成对应的语音文件,并按照指定格式保存。运行评估脚本时,需指定生成音频的目录、数据集路径及检查点位置。评估过程将自动计算每个声学环境下的词错误率和说话人相似度,最终输出汇总结果。该流程支持扁平或按分割目录的音频文件布局,便于研究者系统分析模型在不同声学条件下的鲁棒性表现。
背景与挑战
背景概述
Raon-OpenTTS-Eval数据集由KRAFTON AI于2026年推出,旨在为零样本文本到语音合成领域提供一个全面的鲁棒性评估基准。该数据集整合了来自12个不同来源的6000个提示-文本对,覆盖了清洁、噪声、自然对话和情感表达四种声学场景。其核心研究问题在于解决现有零样本TTS评估基准的局限性,即通常仅基于单一朗读语音数据集进行评估,无法全面反映模型在真实复杂场景下的性能。通过系统性地采样多样化的真实世界语音条件,该数据集为零样本TTS模型的鲁棒性分析提供了标准化框架,推动了语音合成技术向更实用、更稳健的方向发展。
当前挑战
该数据集致力于解决零样本TTS模型在多样化真实场景下鲁棒性评估的挑战。传统评估往往局限于单一、理想的录音环境,难以衡量模型在背景噪声、自然对话或丰富情感表达等复杂条件下的表现。构建过程中的主要挑战包括:确保数据来源的多样性与代表性,需从多个具有不同声学特性的数据集中进行分层采样;处理原始数据中的转录质量问题,例如在AMI-SDM数据集中,需利用Whisper模型筛选出词错误率为零的片段以保证评估可靠性;以及协调不同来源数据的许可协议,最终采用最严格的CC BY-NC-ND 4.0许可进行整体发布。
常用场景
经典使用场景
在语音合成领域,评估零样本文本到语音模型的鲁棒性是一项核心挑战。Raon-OpenTTS-Eval数据集通过整合清洁、嘈杂、自然对话和富有表现力四种声学场景,构建了一个全面的评估基准。其经典使用场景在于系统性地测试TTS模型在不同真实录音条件下的适应能力,例如在背景噪声干扰的嘈杂环境或充满情感变化的表达性语音中,模型能否保持语音质量和说话人相似度。研究人员利用该数据集的分割配置,可以深入分析模型在跨场景泛化中的性能瓶颈。
解决学术问题
该数据集主要解决了零样本TTS研究中评估场景单一化的问题。传统基准多依赖于单一朗读语音数据,难以全面反映模型在实际复杂环境中的鲁棒性。Raon-OpenTTS-Eval通过从12个来源数据集中分层采样,构建了涵盖多种声学状态的提示-目标对,使得学术界能够系统探究模型在噪声鲁棒性、口语对话适应性以及情感韵律保持等方面的表现。其严格的转录筛选机制,如对AMI-SDM数据采用零词错误率过滤,确保了评估的可靠性与准确性,推动了语音合成评估向更严谨、更贴近实际应用的方向发展。
衍生相关工作
围绕该数据集,已衍生出一系列专注于鲁棒性评估与模型改进的经典研究工作。例如,基于其多场景评估框架,研究者开发了更精细的声学特征相似度度量方法,如结合WavLM-large的说话人验证模型来计算SIM指标。同时,该数据集促使了针对跨领域泛化的零样本TTS模型架构创新,例如设计对抗噪声的提示编码器或情感感知的声学建模模块。这些工作不仅深化了对TPS模型鲁棒性机制的理解,也为构建更稳健、更通用的语音合成系统奠定了实证基础。
以上内容由遇见数据集搜集并总结生成



