zhaochenyang20/seed-tts-eval
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/zhaochenyang20/seed-tts-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-speech
language:
- en
- zh
tags:
- tts
- speech-synthesis
- voice-cloning
- seed-tts-eval
- sglang
pretty_name: seed-tts-eval
size_categories:
- 1K<n<10K
---
# seed-tts-eval
A preprocessed copy of the [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval) test set, used by [SGLang Omni](https://github.com/sgl-project/sglang-omni) for TTS benchmarking (WER and speed evaluation).
We thank the researchers of ByteDance for releasing the original evaluation data and methodology. This dataset simply reorganizes their test sets into a single Hugging Face repository for convenience.
## Evaluation Sets
This dataset contains 5 evaluation sets across English and Chinese:
| # | File | Language | Samples | Columns | Difficulty | Description |
|---|---|---|---|---|---|---|
| 1 | `en/meta.lst` | English | 1,088 | 4 | Standard | Same-speaker voice cloning (CommonVoice) |
| 2 | `zh/meta.lst` | Chinese | 2,020 | 4 | Standard | Same-speaker voice cloning (DiDiSpeech-2) |
| 3 | `en/non_para_reconstruct_meta.lst` | English | 1,086 | 5 | Hard | Cross-speaker voice cloning |
| 4 | `zh/non_para_reconstruct_meta.lst` | Chinese | 2,018 | 5 | Hard | Cross-speaker voice cloning |
| 5 | `zh/hardcase.lst` | Chinese | 400 | 4 | Hard | Tongue twisters and repetition patterns |
Sets 1 and 2 (`en/meta.lst` and `zh/meta.lst`) are the standard evaluation sets used by SGLang Omni benchmarks.
Note: Hugging Face may display ~5K samples on this page. That number comes from the auto-detected `audiofolder` format counting every `.wav` file (both prompt wavs and target wavs) individually. The actual evaluation sample counts are listed in the table above.
## File Format
### Standard sets (4 columns)
In `en/meta.lst`, `zh/meta.lst`, and `zh/hardcase.lst`, each line contains the following columns:
```
utterance_id | prompt_text | prompt_wav_path | target_text
```
| Column | Description |
|---|---|
| `utterance_id` | Unique sample identifier |
| `prompt_text` | Transcript of the prompt (reference) audio |
| `prompt_wav_path` | Relative path to the prompt audio file (e.g., `prompt-wavs/xxx.wav`) |
| `target_text` | Text to be synthesized by the TTS model |
```
common_voice_en_10119832-common_voice_en_10119840|We asked over twenty different people, and they all said it was his.|prompt-wavs/common_voice_en_10119832.wav|Get the trust fund to the bank early.
```
### Cross-speaker sets (5 columns)
In `en/non_para_reconstruct_meta.lst` and `zh/non_para_reconstruct_meta.lst`, each line contains the following columns:
```
utterance_id | prompt_text | prompt_wav_path | target_text | target_wav_path
```
In addition to the 4 columns, these files have an additional 5th column:
| Column | Description |
|---|---|
| `target_wav_path` | Relative path to the ground-truth target audio (for reconstruction-based evaluation) |
In cross-speaker sets, the prompt speaker and the target speaker are different people, making voice cloning significantly harder.
## Set Details
### English Standard (`en/meta.lst`)
1,088 samples from [CommonVoice](https://commonvoice.mozilla.org/). The prompt audio and the target text come from the same speaker, testing parallel (same-speaker) voice cloning.
### Chinese Standard (`zh/meta.lst`)
2,020 samples from [DiDiSpeech-2](https://arxiv.org/abs/2010.14956). Same-speaker voice cloning, analogous to the English set.
### English Cross-Speaker (`en/non_para_reconstruct_meta.lst`)
1,086 samples. The prompt and target are from different speakers -- the model must synthesize the target text in the prompt speaker's voice, without having heard that speaker say anything similar. Shares the same target texts as set 1.
### Chinese Cross-Speaker (`zh/non_para_reconstruct_meta.lst`)
2,018 samples. Cross-speaker Chinese evaluation, analogous to set 3. Shares the same target texts as set 2.
### Chinese Hard Cases (`zh/hardcase.lst`)
400 samples split into two categories:
- Tongue twisters (绕口令, `raokouling-*`): 200 samples with phonetically challenging sentences designed to stress-test pronunciation accuracy.
- Repetition patterns: 200 samples with repetitive or stutter-prone text patterns.
## Usage
```bash
# Download the full dataset
huggingface-cli download zhaochenyang20/seed-tts-eval \
--repo-type dataset --local-dir seedtts_testset
```
For CI testing, a minimal subset is available at [`zhaochenyang20/seed-tts-eval-mini`](https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini).
## Directory Structure
```
seed-tts-eval/
├── en/
│ ├── meta.lst # Standard English eval (1,088 samples)
│ ├── non_para_reconstruct_meta.lst # Cross-speaker English eval (1,086 samples)
│ ├── prompt-wavs/ # Reference audio clips (1,007 files)
│ └── wavs/ # Ground-truth target audio (1,092 files)
└── zh/
├── meta.lst # Standard Chinese eval (2,020 samples)
├── non_para_reconstruct_meta.lst # Cross-speaker Chinese eval (2,018 samples)
├── hardcase.lst # Tongue twisters + repetition (400 samples)
├── prompt-wavs/ # Reference audio clips (1,010 files)
└── wavs/ # Ground-truth target audio (2,020 files)
```
## Citation
If you use this dataset, please cite the original seed-tts-eval work:
```bibtex
@article{anastassiou2024seed,
title={Seed-TTS: A Family of High-Quality Versatile Speech Generation Models},
author={Anastassiou, Philip and others},
journal={arXiv preprint arXiv:2406.02430},
year={2024}
}
```
提供机构:
zhaochenyang20



