NonverbalTTS
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/deepvk/NonverbalTTS
下载链接
链接失效反馈官方服务:
资源简介:
# NonverbalTTS Dataset 🎵🗣️
[](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html)
[](https://arxiv.org/abs/2507.13155)
[](https://huggingface.co/datasets/deepvk/NonverbalTTS)
**NonverbalTTS** is a 17-hour open-access English speech corpus with aligned text annotations for **nonverbal vocalizations (NVs)** and **emotional categories**, designed to advance expressive text-to-speech (TTS) research.
## Key Features ✨
- **17 hours** of high-quality speech data
- **10 NV types**: Breathing, laughter, sighing, sneezing, coughing, throat clearing, groaning, grunting, snoring, sniffing
- **8 emotion categories**: Angry, disgusted, fearful, happy, neutral, sad, surprised, other
- **Diverse speakers**: 2296 speakers (60% male, 40% female)
- **Multi-source**: Derived from [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and [Expresso](https://speechbot.github.io/expresso/) corpora
- **Rich metadata**: Emotion labels, NV annotations, speaker IDs, audio quality metrics
- **Sampling rate**: 16kHz for audio from VoxCeleb, 48kHz for audio from Expresso
<!-- ## Dataset Structure 📂
NonverbalTTS/
├── wavs/ # Audio files (16-48kHz WAV format)
│ ├── ex01_sad_00265.wav
│ └── ...
├── .gitattributes
├── README.md
└── metadata.csv # Metadata annotations -->
<!-- ## Metadata Schema (`metadata.csv`) 📋
| Column | Description | Example |
|--------|-------------|---------|
| `index` | Unique sample ID | `ex01_sad_00265` |
| `file_name` | Audio file path | `wavs/ex01_sad_00265.wav` |
| `Emotion` | Emotion label | `sad` |
| `Initial text` | Raw transcription | `"So, Mom, 🌬️ how've you been?"` |
| `Annotator response {1,2,3}` | Refined transcriptions | `"So, Mom, how've you been?"` |
| `Result` | Final fused transcription | `"So, Mom, 🌬️ how've you been?"` |
| `dnsmos` | Audio quality score (1-5) | `3.936982` |
| `duration` | Audio length (seconds) | `3.6338125` |
| `speaker_id` | Speaker identifier | `ex01` |
| `data_name` | Source corpus | `Expresso` |
| `gender` | Speaker gender | `m` | -->
<!-- **NV Symbols**: 🌬️=Breath, 😂=Laughter, etc. (See [Annotation Guidelines](https://zenodo.org/records/15274617)) -->
## Loading the Dataset 💻
```python
from datasets import load_dataset
dataset = load_dataset("deepvk/NonverbalTTS")
```
<!-- # Access train split
```print(dataset["train"][0])```
# Output: {'index': 'ex01_sad_00265', 'file_name': 'wavs/ex01_sad_00265.wav', ...}
-->
## Annotation Pipeline 🔧
1. **Automatic Detection**
- NV detection using [BEATs](https://arxiv.org/abs/2409.09546)
- Emotion classification with [emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large)
- ASR transcription via Canary model
2. **Human Validation**
- 3 annotators per sample
- Filtered non-English/multi-speaker clips
- NV/emotion validation and refinement
3. **Fusion Algorithm**
- Majority voting for final transcriptions
- Pyalign-based sequence alignment
- Multi-annotator hypothesis merging
## Benchmark Results 📊
Fine-tuning CosyVoice-300M on NonverbalTTS achieves parity with state-of-the-art proprietary systems:
|Metric | NVTTS | CosyVoice2 |
| ------- | ------- | ------- |
|Speaker Similarity | 0.89 | 0.85 |
|NV Jaccard | 0.8 | 0.78 |
|Human Preference | 33.4% | 35.4% |
## Use Cases 💡
- Training expressive TTS models
- Zero-shot NV synthesis
- Emotion-aware speech generation
- Prosody modeling research
## License 📜
- Annotations: CC BY-NC-SA 4.0
- Audio: Adheres to original source licenses (VoxCeleb, Expresso)
## Citation 📝
```
@inproceedings{borisov25_ssw,
title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}},
author = {Maksim Borisov and Egor Spirin and Daria Diatlova},
year = {2025},
booktitle = {{13th edition of the Speech Synthesis Workshop}},
pages = {104--109},
doi = {10.21437/SSW.2025-16},
}
```
# NonverbalTTS 数据集 🎵🗣️
[](https://www.isca-archive.org/ssw_2025/borisov25_ssw.html)
[](https://arxiv.org/abs/2507.13155)
[](https://huggingface.co/datasets/deepvk/NonverbalTTS)
**NonverbalTTS** 是一个17小时的开放获取英文语音语料库,带有对齐的非语音发声(nonverbal vocalizations, NVs)和情感类别文本标注,旨在推动表达性文本转语音(text-to-speech, TTS)领域的研究。
## 核心特性 ✨
- **17小时** 高质量语音数据
- **10类非语音发声**:呼吸、大笑、叹息、喷嚏、咳嗽、清嗓、呻吟、咕噜、打鼾、嗅闻
- **8类情感类别**:愤怒、厌恶、恐惧、开心、中性、悲伤、惊讶、其他
- **多样化说话人**:2296名说话人(60%为男性,40%为女性)
- **多源构建**:源自[VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)与[Expresso](https://speechbot.github.io/expresso/)语料库
- **丰富元数据**:情感标签、非语音发声标注、说话人ID、音频质量指标
- **采样率**:VoxCeleb来源音频为16kHz,Expresso来源音频为48kHz
<!-- ## 数据集结构 📂
NonverbalTTS/
├── wavs/ # 音频文件(16-48kHz WAV格式)
│ ├── ex01_sad_00265.wav
│ └── ...
├── .gitattributes
├── README.md
└── metadata.csv # 元数据标注文件 -->
<!-- ## 元数据架构(`metadata.csv`) 📋
| 列名 | 描述 | 示例 |
|--------|-------------|---------|
| `index` | 唯一样本ID | `ex01_sad_00265` |
| `file_name` | 音频文件路径 | `wavs/ex01_sad_00265.wav` |
| `Emotion` | 情感标签 | `sad` |
| `Initial text` | 原始转录文本 | `"So, Mom, 🌬️ how've you been?"` |
| `Annotator response {1,2,3}` | 优化后的转录文本 | `"So, Mom, how've you been?"` |
| `Result` | 最终融合转录文本 | `"So, Mom, 🌬️ how've you been?"` |
| `dnsmos` | 音频质量评分(1-5分) | `3.936982` |
| `duration` | 音频时长(秒) | `3.6338125` |
| `speaker_id` | 说话人标识符 | `ex01` |
| `data_name` | 来源语料库 | `Expresso` |
| `gender` | 说话人性别 | `m` | -->
<!-- **非语音发声符号**:🌬️=呼吸,😂=大笑等(详见[标注指南](https://zenodo.org/records/15274617)) -->
## 数据集加载 💻
python
from datasets import load_dataset
dataset = load_dataset("deepvk/NonverbalTTS")
<!-- # 访问训练集拆分
python
print(dataset["train"][0])
# 输出结果: {'index': 'ex01_sad_00265', 'file_name': 'wavs/ex01_sad_00265.wav', ...} -->
## 标注流程 🔧
1. **自动检测**
- 使用[BEATs](https://arxiv.org/abs/2409.09546)进行非语音发声检测
- 使用[emotion2vec+](https://huggingface.co/emotion2vec/emotion2vec_plus_large)完成情感分类
- 通过Canary模型实现自动语音识别(Automatic Speech Recognition, ASR)转录
2. **人工校验**
- 每个样本由3名标注人员处理
- 过滤非英文、多说话人片段
- 非语音发声与情感标注的校验与优化
3. **融合算法**
- 采用多数投票确定最终转录文本
- 基于Pyalign的序列对齐
- 多标注人员假设结果融合
## 基准测试结果 📊
在NonverbalTTS上微调CosyVoice-300M可达到与当前顶尖专有系统相当的性能:
| 指标 | NVTTS | CosyVoice2 |
| ------- | ------- | ------- |
| 说话人相似度 | 0.89 | 0.85 |
| 非语音发声Jaccard指数 | 0.8 | 0.78 |
| 人类偏好率 | 33.4% | 35.4% |
## 应用场景 💡
- 训练表达性文本转语音模型
- 零样本非语音发声合成
- 情感感知语音生成
- 韵律建模相关研究
## 授权协议 📜
- 标注内容:采用CC BY-NC-SA 4.0协议
- 音频内容:遵循原始来源语料库的授权协议(VoxCeleb、Expresso)
## 引用格式 📝
@inproceedings{borisov25_ssw,
title = {{NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech}},
author = {Maksim Borisov and Egor Spirin and Daria Diatlova},
year = {2025},
booktitle = {{13th edition of the Speech Synthesis Workshop}},
pages = {104--109},
doi = {10.21437/SSW.2025-16},
}
提供机构:
maas
创建时间:
2025-08-01



