five

NonverbalTTS

收藏
魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/deepvk/NonverbalTTS
下载链接
链接失效反馈
官方服务:
资源简介:
# NonverbalTTS Dataset 🎵🗣️ [![interspeech](https://img.shields.io/badge/isca_archive-borisov25_ssw-red.svg?style=plastic)](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html) [![arxiv](https://img.shields.io/badge/arXiv-2507.13155-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2507.13155) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/deepvk/NonverbalTTS) **NonverbalTTS** is a 17-hour open-access English speech corpus with aligned text annotations for **nonverbal vocalizations (NVs)** and **emotional categories**, designed to advance expressive text-to-speech (TTS) research. ## Key Features ✨ - **17 hours** of high-quality speech data - **10 NV types**: Breathing, laughter, sighing, sneezing, coughing, throat clearing, groaning, grunting, snoring, sniffing - **8 emotion categories**: Angry, disgusted, fearful, happy, neutral, sad, surprised, other - **Diverse speakers**: 2296 speakers (60% male, 40% female) - **Multi-source**: Derived from [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and [Expresso](https://speechbot.github.io/expresso/) corpora - **Rich metadata**: Emotion labels, NV annotations, speaker IDs, audio quality metrics - **Sampling rate**: 16kHz for audio from VoxCeleb, 48kHz for audio from Expresso <!-- ## Dataset Structure 📂 NonverbalTTS/ ├── wavs/ # Audio files (16-48kHz WAV format) │ ├── ex01_sad_00265.wav │ └── ... ├── .gitattributes ├── README.md └── metadata.csv # Metadata annotations --> <!-- ## Metadata Schema (`metadata.csv`) 📋 | Column | Description | Example | |--------|-------------|---------| | `index` | Unique sample ID | `ex01_sad_00265` | | `file_name` | Audio file path | `wavs/ex01_sad_00265.wav` | | `Emotion` | Emotion label | `sad` | | `Initial text` | Raw transcription | `"So, Mom, 🌬️ how've you been?"` | | `Annotator response {1,2,3}` | Refined transcriptions | `"So, Mom, how've you been?"` | | `Result` | Final fused transcription | `"So, Mom, 🌬️ how've you been?"` | | `dnsmos` | Audio quality score (1-5) | `3.936982` | | `duration` | Audio length (seconds) | `3.6338125` | | `speaker_id` | Speaker identifier | `ex01` | | `data_name` | Source corpus | `Expresso` | | `gender` | Speaker gender | `m` | --> <!-- **NV Symbols**: 🌬️=Breath, 😂=Laughter, etc. (See [Annotation Guidelines](https://zenodo.org/records/15274617)) --> ## Loading the Dataset 💻 ```python from datasets import load_dataset dataset = load_dataset("deepvk/NonverbalTTS") ``` <!-- # Access train split ```print(dataset["train"][0])``` # Output: {'index': 'ex01_sad_00265', 'file_name': 'wavs/ex01_sad_00265.wav', ...} --> ## Annotation Pipeline 🔧 1. **Automatic Detection** - NV detection using [BEATs](https://arxiv.org/abs/2409.09546) - Emotion classification with [emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large) - ASR transcription via Canary model 2. **Human Validation** - 3 annotators per sample - Filtered non-English/multi-speaker clips - NV/emotion validation and refinement 3. **Fusion Algorithm** - Majority voting for final transcriptions - Pyalign-based sequence alignment - Multi-annotator hypothesis merging ## Benchmark Results 📊 Fine-tuning CosyVoice-300M on NonverbalTTS achieves parity with state-of-the-art proprietary systems: |Metric | NVTTS | CosyVoice2 | | ------- | ------- | ------- | |Speaker Similarity | 0.89 | 0.85 | |NV Jaccard | 0.8 | 0.78 | |Human Preference | 33.4% | 35.4% | ## Use Cases 💡 - Training expressive TTS models - Zero-shot NV synthesis - Emotion-aware speech generation - Prosody modeling research ## License 📜 - Annotations: CC BY-NC-SA 4.0 - Audio: Adheres to original source licenses (VoxCeleb, Expresso) ## Citation 📝 ``` @inproceedings{borisov25_ssw, title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}}, author = {Maksim Borisov and Egor Spirin and Daria Diatlova}, year = {2025}, booktitle = {{13th edition of the Speech Synthesis Workshop}}, pages = {104--109}, doi = {10.21437/SSW.2025-16}, } ```

# NonverbalTTS 数据集 🎵🗣️ [![interspeech](https://img.shields.io/badge/isca_archive-borisov25_ssw-red.svg?style=plastic)](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html) [![arxiv](https://img.shields.io/badge/arXiv-2507.13155-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2507.13155) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/deepvk/NonverbalTTS) **NonverbalTTS** 是一个17小时的开放获取英文语音语料库,带有对齐的非语音发声(nonverbal vocalizations, NVs)和情感类别文本标注,旨在推动表达性文本转语音(text-to-speech, TTS)领域的研究。 ## 核心特性 ✨ - **17小时** 高质量语音数据 - **10类非语音发声**:呼吸、大笑、叹息、喷嚏、咳嗽、清嗓、呻吟、咕噜、打鼾、嗅闻 - **8类情感类别**:愤怒、厌恶、恐惧、开心、中性、悲伤、惊讶、其他 - **多样化说话人**:2296名说话人(60%为男性,40%为女性) - **多源构建**:源自[VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)与[Expresso](https://speechbot.github.io/expresso/)语料库 - **丰富元数据**:情感标签、非语音发声标注、说话人ID、音频质量指标 - **采样率**:VoxCeleb来源音频为16kHz,Expresso来源音频为48kHz <!-- ## 数据集结构 📂 NonverbalTTS/ ├── wavs/ # 音频文件(16-48kHz WAV格式) │ ├── ex01_sad_00265.wav │ └── ... ├── .gitattributes ├── README.md └── metadata.csv # 元数据标注文件 --> <!-- ## 元数据架构(`metadata.csv`) 📋 | 列名 | 描述 | 示例 | |--------|-------------|---------| | `index` | 唯一样本ID | `ex01_sad_00265` | | `file_name` | 音频文件路径 | `wavs/ex01_sad_00265.wav` | | `Emotion` | 情感标签 | `sad` | | `Initial text` | 原始转录文本 | `"So, Mom, 🌬️ how've you been?"` | | `Annotator response {1,2,3}` | 优化后的转录文本 | `"So, Mom, how've you been?"` | | `Result` | 最终融合转录文本 | `"So, Mom, 🌬️ how've you been?"` | | `dnsmos` | 音频质量评分(1-5分) | `3.936982` | | `duration` | 音频时长(秒) | `3.6338125` | | `speaker_id` | 说话人标识符 | `ex01` | | `data_name` | 来源语料库 | `Expresso` | | `gender` | 说话人性别 | `m` | --> <!-- **非语音发声符号**:🌬️=呼吸,😂=大笑等(详见[标注指南](https://zenodo.org/records/15274617)) --> ## 数据集加载 💻 python from datasets import load_dataset dataset = load_dataset("deepvk/NonverbalTTS") <!-- # 访问训练集拆分 python print(dataset["train"][0]) # 输出结果: {'index': 'ex01_sad_00265', 'file_name': 'wavs/ex01_sad_00265.wav', ...} --> ## 标注流程 🔧 1. **自动检测** - 使用[BEATs](https://arxiv.org/abs/2409.09546)进行非语音发声检测 - 使用[emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large)完成情感分类 - 通过Canary模型实现自动语音识别(Automatic Speech Recognition, ASR)转录 2. **人工校验** - 每个样本由3名标注人员处理 - 过滤非英文、多说话人片段 - 非语音发声与情感标注的校验与优化 3. **融合算法** - 采用多数投票确定最终转录文本 - 基于Pyalign的序列对齐 - 多标注人员假设结果融合 ## 基准测试结果 📊 在NonverbalTTS上微调CosyVoice-300M可达到与当前顶尖专有系统相当的性能: | 指标 | NVTTS | CosyVoice2 | | ------- | ------- | ------- | | 说话人相似度 | 0.89 | 0.85 | | 非语音发声Jaccard指数 | 0.8 | 0.78 | | 人类偏好率 | 33.4% | 35.4% | ## 应用场景 💡 - 训练表达性文本转语音模型 - 零样本非语音发声合成 - 情感感知语音生成 - 韵律建模相关研究 ## 授权协议 📜 - 标注内容:采用CC BY-NC-SA 4.0协议 - 音频内容:遵循原始来源语料库的授权协议(VoxCeleb、Expresso) ## 引用格式 📝 @inproceedings{borisov25_ssw, title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}}, author = {Maksim Borisov and Egor Spirin and Daria Diatlova}, year = {2025}, booktitle = {{13th edition of the Speech Synthesis Workshop}}, pages = {104--109}, doi = {10.21437/SSW.2025-16}, }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作