zhaochenyang20/seed-tts-eval

Name: zhaochenyang20/seed-tts-eval
Creator: zhaochenyang20
Published: 2026-03-26 21:23:57
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/zhaochenyang20/seed-tts-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-to-speech language: - en - zh tags: - tts - speech-synthesis - voice-cloning - seed-tts-eval - sglang pretty_name: seed-tts-eval size_categories: - 1K<n<10K --- # seed-tts-eval A preprocessed copy of the [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval) test set, used by [SGLang Omni](https://github.com/sgl-project/sglang-omni) for TTS benchmarking (WER and speed evaluation). We thank the researchers of ByteDance for releasing the original evaluation data and methodology. This dataset simply reorganizes their test sets into a single Hugging Face repository for convenience. ## Evaluation Sets This dataset contains 5 evaluation sets across English and Chinese: | # | File | Language | Samples | Columns | Difficulty | Description | |---|---|---|---|---|---|---| | 1 | `en/meta.lst` | English | 1,088 | 4 | Standard | Same-speaker voice cloning (CommonVoice) | | 2 | `zh/meta.lst` | Chinese | 2,020 | 4 | Standard | Same-speaker voice cloning (DiDiSpeech-2) | | 3 | `en/non_para_reconstruct_meta.lst` | English | 1,086 | 5 | Hard | Cross-speaker voice cloning | | 4 | `zh/non_para_reconstruct_meta.lst` | Chinese | 2,018 | 5 | Hard | Cross-speaker voice cloning | | 5 | `zh/hardcase.lst` | Chinese | 400 | 4 | Hard | Tongue twisters and repetition patterns | Sets 1 and 2 (`en/meta.lst` and `zh/meta.lst`) are the standard evaluation sets used by SGLang Omni benchmarks. Note: Hugging Face may display ~5K samples on this page. That number comes from the auto-detected `audiofolder` format counting every `.wav` file (both prompt wavs and target wavs) individually. The actual evaluation sample counts are listed in the table above. ## File Format ### Standard sets (4 columns) In `en/meta.lst`, `zh/meta.lst`, and `zh/hardcase.lst`, each line contains the following columns: ``` utterance_id | prompt_text | prompt_wav_path | target_text ``` | Column | Description | |---|---| | `utterance_id` | Unique sample identifier | | `prompt_text` | Transcript of the prompt (reference) audio | | `prompt_wav_path` | Relative path to the prompt audio file (e.g., `prompt-wavs/xxx.wav`) | | `target_text` | Text to be synthesized by the TTS model | ``` common_voice_en_10119832-common_voice_en_10119840|We asked over twenty different people, and they all said it was his.|prompt-wavs/common_voice_en_10119832.wav|Get the trust fund to the bank early. ``` ### Cross-speaker sets (5 columns) In `en/non_para_reconstruct_meta.lst` and `zh/non_para_reconstruct_meta.lst`, each line contains the following columns: ``` utterance_id | prompt_text | prompt_wav_path | target_text | target_wav_path ``` In addition to the 4 columns, these files have an additional 5th column: | Column | Description | |---|---| | `target_wav_path` | Relative path to the ground-truth target audio (for reconstruction-based evaluation) | In cross-speaker sets, the prompt speaker and the target speaker are different people, making voice cloning significantly harder. ## Set Details ### English Standard (`en/meta.lst`) 1,088 samples from [CommonVoice](https://commonvoice.mozilla.org/). The prompt audio and the target text come from the same speaker, testing parallel (same-speaker) voice cloning. ### Chinese Standard (`zh/meta.lst`) 2,020 samples from [DiDiSpeech-2](https://arxiv.org/abs/2010.14956). Same-speaker voice cloning, analogous to the English set. ### English Cross-Speaker (`en/non_para_reconstruct_meta.lst`) 1,086 samples. The prompt and target are from different speakers -- the model must synthesize the target text in the prompt speaker's voice, without having heard that speaker say anything similar. Shares the same target texts as set 1. ### Chinese Cross-Speaker (`zh/non_para_reconstruct_meta.lst`) 2,018 samples. Cross-speaker Chinese evaluation, analogous to set 3. Shares the same target texts as set 2. ### Chinese Hard Cases (`zh/hardcase.lst`) 400 samples split into two categories: - Tongue twisters (绕口令, `raokouling-*`): 200 samples with phonetically challenging sentences designed to stress-test pronunciation accuracy. - Repetition patterns: 200 samples with repetitive or stutter-prone text patterns. ## Usage ```bash # Download the full dataset huggingface-cli download zhaochenyang20/seed-tts-eval \ --repo-type dataset --local-dir seedtts_testset ``` For CI testing, a minimal subset is available at [`zhaochenyang20/seed-tts-eval-mini`](https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini). ## Directory Structure ``` seed-tts-eval/ ├── en/ │ ├── meta.lst # Standard English eval (1,088 samples) │ ├── non_para_reconstruct_meta.lst # Cross-speaker English eval (1,086 samples) │ ├── prompt-wavs/ # Reference audio clips (1,007 files) │ └── wavs/ # Ground-truth target audio (1,092 files) └── zh/ ├── meta.lst # Standard Chinese eval (2,020 samples) ├── non_para_reconstruct_meta.lst # Cross-speaker Chinese eval (2,018 samples) ├── hardcase.lst # Tongue twisters + repetition (400 samples) ├── prompt-wavs/ # Reference audio clips (1,010 files) └── wavs/ # Ground-truth target audio (2,020 files) ``` ## Citation If you use this dataset, please cite the original seed-tts-eval work: ```bibtex @article{anastassiou2024seed, title={Seed-TTS: A Family of High-Quality Versatile Speech Generation Models}, author={Anastassiou, Philip and others}, journal={arXiv preprint arXiv:2406.02430}, year={2024} } ```

提供机构：

zhaochenyang20

5,000+

优质数据集

54 个

任务类型

进入经典数据集