CapSpeech

Name: CapSpeech
Creator: maas
Published: 2025-10-04 16:46:39
License: 暂无描述

魔搭社区2025-10-04 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/OpenSound/CapSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

# CapSpeech DataSet used for the paper: ***CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech*** Please refer to [CapSpeech](https://github.com/WangHelin1997/CapSpeech) repo for more details. ## Overview 🔥 CapSpeech is a new benchmark designed for style-captioned TTS (**CapTTS**) tasks, including style-captioned text-to-speech synthesis with sound effects (**CapTTS-SE**), accent-captioned TTS (**AccCapTTS**), emotion-captioned TTS (**EmoCapTTS**) and text-to-speech synthesis for chat agent (**AgentTTS**). CapSpeech comprises over **10 million machine-annotated** audio-caption pairs and nearly **0.36 million human-annotated** audio-caption pairs. **3 new speech datasets** are specifically designed for the CapTTS-SE and AgentTTS tasks to enhance the benchmark’s coverage of real-world scenarios. ![Overview](https://raw.githubusercontent.com/WangHelin1997/CapSpeech-demo/main/static/images/present.jpg) ## License ⚠️ All resources are under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. ## Usage You can use the dataset as follows: ```py from datasets import load_dataset # Load the entire dataset dataset = load_dataset("OpenSound/CapSpeech") # Load specific splits of the dataset, e.g. train_pt = load_dataset("OpenSound/CapSpeech", split="train_PT") test_agentdb = load_dataset("OpenSound/CapSpeech", split="test_AgentDB") # View a single example example = train_pt[0] print(example) ``` ## Dataset Structure The dataset contains the following columns: | Column | Type | Description | |---------|------|-------------| | source | string | Source dataset (e.g., gigaspeech, commonvoice, libritts-r) | | audio_path | string | Relative audio path to identify the specific audio file | | text | strings | Transcription of the audio file | | caption | string | Style Caption of the audio file | | speech_duration | float | Duration of the audio file | The *audio_path* field contains relative paths. Please ensure they are correctly mapped to absolute paths in your environment. ### Dataset Descriptions The dataset covers both pretraining (PT) and supervised fine-tuning (SFT) stages, as well as downstream tasks including CapTTS, CapTTS-SE, AccCapTTS, emoCapTTS, and AgentTTS. We also provide detailed annotations in the following links. | Split | Description| Audio Source | Annotation Link | |-------|-------------------|------------------------------------------|---------------------------| | train_PT | Training Data for *CapTTS* and *CapTTS-SE* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/), [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT | Validation Data for *CapTTS* and *CapTTS-SE* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/), [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT | Test Data for *CapTTS* and *CapTTS-SE* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/), [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_CapTTS | Training Data for *CapTTS* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT_CapTTS | Validation Data for *CapTTS* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT_CapTTS | Test Data for *CapTTS* used in the **PT** stage | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152), [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [CommonVoice](https://commonvoice.mozilla.org/en/datasets), [MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_SEDB | Training Data for *CapTTS-SE* used in the **PT** stage| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT_SEDB | Validation Data for *CapTTS-SE* used in the **PT** stage| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT_SEDB | Test Data for *CapTTS-SE* used in the **PT** stage| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_SEDB_HQ| High-quality training Data for *CapTTS-SE* used in the **PT** stage | [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | validation_PT_SEDB_HQ | High-quality validation Data for *CapTTS-SE* used in the **PT** stage| [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | test_PT_SEDB_HQ | High-quality test Data for *CapTTS-SE* used in the **PT** stage| [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | train_SFT_CapTTS | Training Data for *CapTTS* used in the **SFT** stage| [LibriTTS-R](https://www.openslr.org/141/), [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_CapTTS | Validation Data for *CapTTS* used in the **SFT** stage | [LibriTTS-R](https://www.openslr.org/141/), [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_CapTTS | Test Data for *CapTTS* used in the **SFT** stage | [LibriTTS-R](https://www.openslr.org/141/), [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_SFT_EmoCapTTS | Training Data for *EmoCapTTS* used in the **SFT** stage| [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_EmoCapTTS | Validation Data for *EmoCapTTS* used in the **SFT** stage| [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_EmoCapTTS | Test Data for *EmoCapTTS* used in the **SFT** stage | [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py), [Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_SFT_AccCapTTS | Training Data for *AccCapTTS* used in the **SFT** stage| [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_AccCapTTS | Validation Data for *AccCapTTS* used in the **SFT** stage| [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_AccCapTTS | Test Data for *AccCapTTS* used in the **SFT** stage| [VoxCeleb and VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/), [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_AgentDB | Training Data for *AgentTTS* used in the **SFT** stage| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB-Audio)| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB)| | test_AgentDB | Test Data for *AgentTTS* used in the **SFT** stage| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB-Audio) | [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB)| | train_SEDB | Training Data for *CapTTS-SE* used in the **SFT** stage| [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB-Audio) | [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB)| | test_SEDB | Test Data for *CapTTS-SE* used in the **SFT** stage| [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB-Audio) | [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB)| ## Citation If you use this dataset, the models or the repository, please cite our work as follows: ```bibtex @misc{wang2025capspeechenablingdownstreamapplications, title={CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech}, author={Helin Wang and Jiarui Hai and Dading Chong and Karan Thakkar and Tiantian Feng and Dongchao Yang and Junhyeok Lee and Laureano Moro Velazquez and Jesus Villalba and Zengyi Qin and Shrikanth Narayanan and Mounya Elhiali and Najim Dehak}, year={2025}, eprint={2506.02863}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.02863}, } ```

# CapSpeech 本数据集配套论文《CapSpeech：赋能风格标注文本到语音合成的下游应用》（CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech），更多细节请参阅[CapSpeech](https://github.com/WangHelin1997/CapSpeech)代码仓库。 ## 概述 🔥 CapSpeech是一款专为风格标注文本到语音合成（Style-Captioned Text-to-Speech，以下简称CapTTS）任务打造的全新基准数据集，涵盖带音效的风格标注文本到语音合成（CapTTS-SE）、口音标注文本到语音合成（AccCapTTS）、情感标注文本到语音合成（EmoCapTTS）以及面向AI智能体（AI Agent）的文本到语音合成（AgentTTS）四大任务方向。 CapSpeech包含超1000万条机器标注的音频-文本对，以及近36万条人工标注的音频-文本对。团队还专门为CapTTS-SE与AgentTTS任务构建了3个全新的语音数据集，以拓展基准数据集在真实场景中的覆盖范围。 ![Overview](https://raw.githubusercontent.com/WangHelin1997/CapSpeech-demo/main/static/images/present.jpg) ## 许可协议 ⚠️ 所有资源均采用[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)许可协议发布。 ## 使用方法你可以通过如下方式使用该数据集： py from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("OpenSound/CapSpeech") # 加载指定划分的数据集，例如： train_pt = load_dataset("OpenSound/CapSpeech", split="train_PT") test_agentdb = load_dataset("OpenSound/CapSpeech", split="test_AgentDB") # 查看单条样本 example = train_pt[0] print(example) ## 数据集结构该数据集包含以下字段： | 字段名 | 数据类型 | 说明 | |---------|------|-------------| | source | 字符串 | 源数据集（例如gigaspeech、commonvoice、libritts-r等） | | audio_path | 字符串 | 用于标识特定音频文件的相对路径 | | text | 字符串数组 | 音频文件的转写文本 | | caption | 字符串 | 音频文件的风格标注文本 | | speech_duration | 浮点数 | 音频文件的时长 | 注：`audio_path`字段为相对路径，请确保在你的运行环境中正确将其映射为绝对路径。 ### 数据集划分说明本数据集覆盖预训练（Pretraining, PT）与监督微调（Supervised Fine-Tuning, SFT）两大阶段，同时适配CapTTS、CapTTS-SE、AccCapTTS、EmoCapTTS以及AgentTTS等下游任务。我们还通过以下链接提供了详细的标注信息： | 数据集划分 | 说明 | 音频数据源 | 标注链接 | |-------|-------------------|------------------------------------------|---------------------------| | train_PT | 预训练阶段用于CapTTS与CapTTS-SE的训练数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/)、[CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT | 预训练阶段用于CapTTS与CapTTS-SE的验证数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/)、[CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT | 预训练阶段用于CapTTS与CapTTS-SE的测试数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/)、[CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_CapTTS | 预训练阶段用于CapTTS的训练数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT_CapTTS | 预训练阶段用于CapTTS的验证数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT_CapTTS | 预训练阶段用于CapTTS的测试数据 | [Emilia-EN](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07e8572f5f3be1dbd02ed3172a4d298f152)、[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)、[CommonVoice](https://commonvoice.mozilla.org/en/datasets)、[MLS-English](https://openslr.org/94/) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_SEDB | 预训练阶段用于CapTTS-SE的训练数据| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | validation_PT_SEDB | 预训练阶段用于CapTTS-SE的验证数据| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | test_PT_SEDB | 预训练阶段用于CapTTS-SE的测试数据| [CapSpeech-PT-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT](https://huggingface.co/datasets/OpenSound/CapSpeech-PT)| | train_PT_SEDB_HQ| 预训练阶段用于CapTTS-SE的高质量训练数据 | [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | validation_PT_SEDB_HQ | 预训练阶段用于CapTTS-SE的高质量验证数据| [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | test_PT_SEDB_HQ | 预训练阶段用于CapTTS-SE的高质量测试数据| [CapSpeech-PT-SEDB-Audio](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-Audio) | [CapSpeech-PT-SEDB-HQ](https://huggingface.co/datasets/OpenSound/CapSpeech-PT-SEDB-HQ)| | train_SFT_CapTTS | 监督微调阶段用于CapTTS的训练数据| [LibriTTS-R](https://www.openslr.org/141/)、[VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_CapTTS | 监督微调阶段用于CapTTS的验证数据 | [LibriTTS-R](https://www.openslr.org/141/)、[VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_CapTTS | 监督微调阶段用于CapTTS的测试数据 | [LibriTTS-R](https://www.openslr.org/141/)、[VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_SFT_EmoCapTTS | 监督微调阶段用于EmoCapTTS的训练数据| [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_EmoCapTTS | 监督微调阶段用于EmoCapTTS的验证数据| [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_EmoCapTTS | 监督微调阶段用于EmoCapTTS的测试数据 | [EARS](https://github.com/facebookresearch/ears_dataset/blob/main/download_ears.py)、[Expresso](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_SFT_AccCapTTS | 监督微调阶段用于AccCapTTS的训练数据| [VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | validation_SFT_AccCapTTS | 监督微调阶段用于AccCapTTS的验证数据| [VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | test_SFT_AccCapTTS | 监督微调阶段用于AccCapTTS的测试数据| [VoxCeleb与VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)、[VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | [CapTTS-SFT](https://huggingface.co/datasets/OpenSound/CapTTS-SFT)| | train_AgentDB | 监督微调阶段用于AgentTTS的训练数据| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB-Audio)| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB)| | test_AgentDB | 监督微调阶段用于AgentTTS的测试数据| [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB-Audio) | [CapSpeech-AgentDB](https://huggingface.co/datasets/OpenSound/CapSpeech-AgentDB)| | train_SEDB | 监督微调阶段用于CapTTS-SE的训练数据| [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB-Audio) | [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB)| | test_SEDB | 监督微调阶段用于CapTTS-SE的测试数据| [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB-Audio) | [CapSpeech-SEDB](https://huggingface.co/datasets/OpenSound/CapSpeech-SEDB)| ## 引用方式如果你使用了本数据集、相关模型或代码仓库，请按照以下格式引用我们的工作： bibtex @misc{wang2025capspeechenablingdownstreamapplications, title={CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech}, author={Helin Wang and Jiarui Hai and Dading Chong and Karan Thakkar and Tiantian Feng and Dongchao Yang and Junhyeok Lee and Laureano Moro Velazquez and Jesus Villalba and Zengyi Qin and Shrikanth Narayanan and Mounya Elhiali and Najim Dehak}, year={2025}, eprint={2506.02863}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.02863}, }

提供机构：

maas

创建时间：

2025-08-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集