NonverbalTTS

Name: NonverbalTTS
Creator: maas
Published: 2025-12-05 16:44:14
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/deepvk/NonverbalTTS

下载链接

链接失效反馈

官方服务：

资源简介：

# NonverbalTTS Dataset 🎵🗣️ [![interspeech](https://img.shields.io/badge/isca_archive-borisov25_ssw-red.svg?style=plastic)](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html) [![arxiv](https://img.shields.io/badge/arXiv-2507.13155-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2507.13155) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/deepvk/NonverbalTTS) **NonverbalTTS** is a 17-hour open-access English speech corpus with aligned text annotations for **nonverbal vocalizations (NVs)** and **emotional categories**, designed to advance expressive text-to-speech (TTS) research. ## Key Features ✨ - **17 hours** of high-quality speech data - **10 NV types**: Breathing, laughter, sighing, sneezing, coughing, throat clearing, groaning, grunting, snoring, sniffing - **8 emotion categories**: Angry, disgusted, fearful, happy, neutral, sad, surprised, other - **Diverse speakers**: 2296 speakers (60% male, 40% female) - **Multi-source**: Derived from [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and [Expresso](https://speechbot.github.io/expresso/) corpora - **Rich metadata**: Emotion labels, NV annotations, speaker IDs, audio quality metrics - **Sampling rate**: 16kHz for audio from VoxCeleb, 48kHz for audio from Expresso    ## Loading the Dataset 💻 ```python from datasets import load_dataset dataset = load_dataset("deepvk/NonverbalTTS") ```  ## Annotation Pipeline 🔧 1. **Automatic Detection** - NV detection using [BEATs](https://arxiv.org/abs/2409.09546) - Emotion classification with [emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large) - ASR transcription via Canary model 2. **Human Validation** - 3 annotators per sample - Filtered non-English/multi-speaker clips - NV/emotion validation and refinement 3. **Fusion Algorithm** - Majority voting for final transcriptions - Pyalign-based sequence alignment - Multi-annotator hypothesis merging ## Benchmark Results 📊 Fine-tuning CosyVoice-300M on NonverbalTTS achieves parity with state-of-the-art proprietary systems: |Metric | NVTTS | CosyVoice2 | | ------- | ------- | ------- | |Speaker Similarity | 0.89 | 0.85 | |NV Jaccard | 0.8 | 0.78 | |Human Preference | 33.4% | 35.4% | ## Use Cases 💡 - Training expressive TTS models - Zero-shot NV synthesis - Emotion-aware speech generation - Prosody modeling research ## License 📜 - Annotations: CC BY-NC-SA 4.0 - Audio: Adheres to original source licenses (VoxCeleb, Expresso) ## Citation 📝 ``` @inproceedings{borisov25_ssw, title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}}, author = {Maksim Borisov and Egor Spirin and Daria Diatlova}, year = {2025}, booktitle = {{13th edition of the Speech Synthesis Workshop}}, pages = {104--109}, doi = {10.21437/SSW.2025-16}, } ```

# NonverbalTTS 数据集 🎵🗣️ [![interspeech](https://img.shields.io/badge/isca_archive-borisov25_ssw-red.svg?style=plastic)](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html) [![arxiv](https://img.shields.io/badge/arXiv-2507.13155-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2507.13155) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/deepvk/NonverbalTTS) **NonverbalTTS** 是一个17小时的开放获取英文语音语料库，带有对齐的非语音发声（nonverbal vocalizations, NVs）和情感类别文本标注，旨在推动表达性文本转语音（text-to-speech, TTS）领域的研究。 ## 核心特性 ✨ - **17小时** 高质量语音数据 - **10类非语音发声**：呼吸、大笑、叹息、喷嚏、咳嗽、清嗓、呻吟、咕噜、打鼾、嗅闻 - **8类情感类别**：愤怒、厌恶、恐惧、开心、中性、悲伤、惊讶、其他 - **多样化说话人**：2296名说话人（60%为男性，40%为女性） - **多源构建**：源自[VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)与[Expresso](https://speechbot.github.io/expresso/)语料库 - **丰富元数据**：情感标签、非语音发声标注、说话人ID、音频质量指标 - **采样率**：VoxCeleb来源音频为16kHz，Expresso来源音频为48kHz    ## 数据集加载 💻 python from datasets import load_dataset dataset = load_dataset("deepvk/NonverbalTTS")  ## 标注流程 🔧 1. **自动检测** - 使用[BEATs](https://arxiv.org/abs/2409.09546)进行非语音发声检测 - 使用[emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large)完成情感分类 - 通过Canary模型实现自动语音识别（Automatic Speech Recognition, ASR）转录 2. **人工校验** - 每个样本由3名标注人员处理 - 过滤非英文、多说话人片段 - 非语音发声与情感标注的校验与优化 3. **融合算法** - 采用多数投票确定最终转录文本 - 基于Pyalign的序列对齐 - 多标注人员假设结果融合 ## 基准测试结果 📊 在NonverbalTTS上微调CosyVoice-300M可达到与当前顶尖专有系统相当的性能： | 指标 | NVTTS | CosyVoice2 | | ------- | ------- | ------- | | 说话人相似度 | 0.89 | 0.85 | | 非语音发声Jaccard指数 | 0.8 | 0.78 | | 人类偏好率 | 33.4% | 35.4% | ## 应用场景 💡 - 训练表达性文本转语音模型 - 零样本非语音发声合成 - 情感感知语音生成 - 韵律建模相关研究 ## 授权协议 📜 - 标注内容：采用CC BY-NC-SA 4.0协议 - 音频内容：遵循原始来源语料库的授权协议（VoxCeleb、Expresso） ## 引用格式 📝 @inproceedings{borisov25_ssw, title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}}, author = {Maksim Borisov and Egor Spirin and Daria Diatlova}, year = {2025}, booktitle = {{13th edition of the Speech Synthesis Workshop}}, pages = {104--109}, doi = {10.21437/SSW.2025-16}, }

提供机构：

maas

创建时间：

2025-08-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集