下载链接：

https://modelscope.cn/datasets/OpenSound/CapSpeech-PT

下载链接

链接失效反馈

官方服务：

资源简介：

# CapSpeech-PT Pretraining dataset used for the paper: ***CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech*** This dataset is used for CapTTS and CapTTS-SE tasks. Please refer to [CapSpeech](https://huggingface.co/datasets/OpenSound/CapSpeech) for the whole dataset. ## Dataset Fields | Field Name | Type | Description | |--------------------|------------|-----------------------------------------------------------------------------| | `audio_path` | `string` | File path to the audio sample. The actual audio is hosted separately. | | `text` | `string` | The transcript corresponding to the audio sample. | | `source` | `string` | The original dataset or corpus the audio is sourced from. | | `speech_duration` | `float32` | Duration of the speech in seconds. | | `pitch` | `string` | Descriptive label of pitch (e.g., "high", "low"). | | `age` | `string` | Age group of the speaker (e.g., "child", "middle-aged"). | | `gender` | `string` | Gender of the speaker (e.g., "male", "female"). | | `speaking_rate` | `string` | Speaking speed (e.g., "slow", "fast"). | | `speech_monotony` | `string` | Monotony or expressiveness of speech (e.g., "monotone", "expressive"). | | `caption` | `string` | A natural language caption describing the style and traits of the speech. | | `intrinsic_tags` | `list[str]`| Tags tied to a speaker's identity (e.g., shrill, guttural) (null if non-existent). | | `situational_tags` | `list[str]`| Tags that characterize individual utterances (e.g., happy, whispered) (null if non-existent). | | `basic_tags` | `list[str]`| Basic tags (pitch, speed, gender, noise conditions). | | `all_tags` | `list[str]`| Combination of all tag types. | | `accent` | `string` | Descriptive label for accent (e.g., "American", "Indian", "British"). | | `noise` | `string` | Description of background noise. | ## Overview 🔥 CapSpeech is a new benchmark designed for style-captioned TTS (**CapTTS**) tasks, including style-captioned text-to-speech synthesis with sound effects (**CapTTS-SE**), accent-captioned TTS (**AccCapTTS**), emotion-captioned TTS (**EmoCapTTS**) and text-to-speech synthesis for chat agent (**AgentTTS**). CapSpeech comprises over **10 million machine-annotated** audio-caption pairs and nearly **0.36 million human-annotated** audio-caption pairs. **3 new speech datasets** are specifically designed for the CapTTS-SE and AgentTTS tasks to enhance the benchmark’s coverage of real-world scenarios. ![Overview](https://raw.githubusercontent.com/WangHelin1997/CapSpeech-demo/main/static/images/present.jpg) ## License ⚠️ All resources are under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. ## Citation If you use this dataset, the models or the repository, please cite our work as follows: ```bibtex @misc{wang2025capspeechenablingdownstreamapplications, title={CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech}, author={Helin Wang and Jiarui Hai and Dading Chong and Karan Thakkar and Tiantian Feng and Dongchao Yang and Junhyeok Lee and Laureano Moro Velazquez and Jesus Villalba and Zengyi Qin and Shrikanth Narayanan and Mounya Elhiali and Najim Dehak}, year={2025}, eprint={2506.02863}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.02863}, } ```

# CapSpeech-PT 本预训练数据集对应论文《CapSpeech：赋能风格标注文本到语音的下游应用》。本数据集适用于CapTTS与CapTTS-SE任务。完整数据集请参见[CapSpeech](https://huggingface.co/datasets/OpenSound/CapSpeech)。 ## 数据集字段 | 字段名 | 数据类型 | 字段描述 | |--------------------|------------|-----------------------------------------------------------------------------| | `audio_path` | `string` | 音频样本的文件路径，实际音频文件另行存储。 | | `text` | `string` | 音频样本对应的转写文本。 | | `source` | `string` | 音频来源的原始数据集或语料库。 | | `speech_duration` | `float32` | 语音时长，单位为秒。 | | `pitch` | `string` | 音高描述标签（例如“高”“低”）。 | | `age` | `string` | 说话者年龄组（例如“儿童”“中年”）。 | | `gender` | `string` | 说话者性别（例如“男”“女”）。 | | `speaking_rate` | `string` | 说话语速（例如“缓慢”“快速”）。 | | `speech_monotony` | `string` | 语音的单调性与表现力（例如“单调”“富有表现力”）。 | | `caption` | `string` | 描述语音风格与特征的自然语言标注文本。 | | `intrinsic_tags` | `list[str]`| 与说话者身份相关的标签（例如“尖细”“粗嘎”，无对应标签时为null）。 | | `situational_tags` | `list[str]`| 表征单条语音片段的标签（例如“开心”“低语”，无对应标签时为null）。 | | `basic_tags` | `list[str]`| 基础标签集合，涵盖音高、语速、性别、噪声环境等维度。 | | `all_tags` | `list[str]`| 所有标签类型的组合集合。 | | `accent` | `string` | 口音描述标签（例如“美式”“印度式”“英式”）。 | | `noise` | `string` | 背景噪声描述。 | ## 数据集概览 🔥 CapSpeech是一款专为风格标注文本到语音（Text-to-Speech，TTS）任务打造的新型基准数据集，涵盖带音效的风格标注文本到语音合成（**CapTTS-SE**）、口音标注文本到语音（**AccCapTTS**）、情感标注文本到语音（**EmoCapTTS**）以及聊天智能体（AI Agent）专用文本到语音合成（**AgentTTS**）四类任务。 CapSpeech包含超**1000万条机器标注**的音频-标注文本配对样本，以及近**36万条人工标注**的同类样本。此外，团队专为CapTTS-SE与AgentTTS任务设计了**3个全新语音数据集**，以提升该基准数据集对真实应用场景的覆盖度。 ![数据集概览](https://raw.githubusercontent.com/WangHelin1997/CapSpeech-demo/main/static/images/present.jpg) ## 许可协议 ⚠️ 所有资源均采用[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)许可协议。 ## 引用格式若您使用本数据集、相关模型或代码仓库，请按以下格式引用本研究： bibtex @misc{wang2025capspeechenablingdownstreamapplications, title={CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech}, author={Helin Wang and Jiarui Hai and Dading Chong and Karan Thakkar and Tiantian Feng and Dongchao Yang and Junhyeok Lee and Laureano Moro Velazquez and Jesus Villalba and Zengyi Qin and Shrikanth Narayanan and Mounya Elhiali and Najim Dehak}, year={2025}, eprint={2506.02863}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.02863}, }

应用场景：