TTS-AGI/vocal-burst-annotation-asr-tuning-dataset
收藏Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/vocal-burst-annotation-asr-tuning-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
- de
- fr
- ja
- zh
task_categories:
- automatic-speech-recognition
tags:
- vocal-bursts
- speaker-diarization
- timestamps
- augmented
- multi-speaker
- multilingual
pretty_name: Vocal Burst Annotation ASR Tuning Dataset
size_categories:
- 100K<n<1M
---
# Vocal Burst Annotation ASR Tuning Dataset
A synthetic **500,000-sample** multilingual dataset for training ASR models with **inline vocal burst captioning**, **speaker diarization**, and **sentence-level timestamps**. Each sample is approximately 1 minute of audio containing speech segments interleaved with vocal bursts (laughs, sighs, coughs, etc.), annotated with precise timing information.
## Example Transcript
```
[nasalized, affirmative hum, steady pitch, moderate intensity] <Speaker_1> Hello world, this is a test.
[breathy, staccato, high-pitched laugh, moderate intensity] <Speaker_2> How are you doing today?
<Speaker_1> I'm fine, thank you very much. [Trembling Whimper faint cry indicating fear or pain]
<Speaker_2> That sounds great, let me tell you about my day.
```
## Dataset Construction
### Source Data
1. **Speech segments**: Drawn from multiple multilingual speech datasets:
- [TTS-AGI/emolia-hq](https://huggingface.co/datasets/TTS-AGI/emolia-hq) — English, Chinese, Japanese (Emilia HQ)
- [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) — German, French
2. **Vocal bursts**: 8,200 samples from 82 categories, sourced from:
- [TTS-AGI/vocal-bursts-taxonomy-DACVAE](https://huggingface.co/datasets/TTS-AGI/vocal-bursts-taxonomy-DACVAE) — with Gemini Flash Lite verification and dense captions
3. **Background music**: [laion/laion-tunes-rpg-music](https://huggingface.co/datasets/laion/laion-tunes-rpg-music) — instrumental RPG music for background augmentation
### Construction Pipeline
Each ~60-second sample is built by:
1. **Speaker selection**: 1–4 speakers chosen per sample (50% single-speaker, 50% multi-speaker)
2. **Speech concatenation**: Speech snippets from chosen speakers are concatenated sequentially
3. **Vocal burst insertion**: Between speech segments, vocal bursts are inserted with a 33% probability. There is also a 33% chance of a vocal burst at the very beginning.
4. **Vocal burst augmentation**: Each inserted vocal burst undergoes a random speed change of ±10%
5. **Global augmentations** (mutually exclusive):
- 20% — Telephone effect (downsample to 8kHz, upsample back)
- 20% — Noise injection (light Gaussian noise)
- 20% — Background music overlay (10–25% of speech volume)
- 40% — Clean (no augmentation)
### Vocal Burst Labeling
Labels are chosen based on Gemini Flash Lite verification scores:
- **Score 0 or 1** (poor/slight match): Always use the Gemini dense caption (e.g., `nasalized, affirmative hum, steady pitch, moderate intensity`)
- **Score 2** (well matched): 50% chance Gemini caption, 50% chance original taxonomy prompt (e.g., `Affirmative Grunt short sound indicating agreement`)
## Output Format
- **Audio**: 24kHz mono MP3, 64kbps
- **Packaging**: WebDataset tar shards (1,000 samples per shard)
- Each sample consists of `{key}.mp3` + `{key}.json`
### JSON Metadata Structure
```json
{
"transcript": "<Speaker_1> Hello world. [breathy laugh] <Speaker_2> How are you?",
"segments": [
{
"type": "speech",
"speaker": "EN_B00045_S00003",
"speaker_label": "Speaker_1",
"text": "Hello world.",
"language": "en",
"source_key": "EN_B00045_S00003_W000012",
"start": 0.0,
"end": 3.5,
"duration": 3.5
},
{
"type": "vocal_burst",
"label_used": "breathy, staccato, high-pitched laugh, moderate intensity",
"prompt": "Cackle loud raucous laugh often with a sharp edge",
"gemini_caption": "breathy, staccato, high-pitched laugh, moderate intensity",
"gemini_match_score": 2,
"category": "Cackle",
"key": "female/sample000123",
"gender": "female",
"speed_factor": 1.05,
"start": 3.5,
"end": 11.2,
"duration": 7.7
},
...
],
"augmentations": ["telephone"],
"speakers": ["EN_B00045_S00003", "EN_B00045_S00013"],
"language": "en",
"duration": 63.5,
"num_speakers": 2
}
```
## Statistics (sampled from 20,000 samples)
### Vocal Burst Distribution
| Vocal Bursts per Sample | Percentage |
|---|---|
| 0 (no bursts) | 6.0% |
| 1 | 26.0% |
| 2 | 37.8% |
| 3 | 23.6% |
| 4 | 6.1% |
| 5+ | 0.5% |
**94.0% of samples contain at least one vocal burst.**
### Language Distribution
| Language | Percentage |
|---|---|
| English (en) | 36.9% |
| French (fr) | 20.2% |
| German (de) | 17.9% |
| Japanese (ja) | 16.6% |
| Chinese (zh) | 8.4% |
### Speaker Count Distribution
| Speakers | Percentage |
|---|---|
| 1 speaker | 49.5% |
| 2 speakers | 17.1% |
| 3 speakers | 16.4% |
| 4 speakers | 17.0% |
### Augmentation Distribution
| Augmentation | Percentage |
|---|---|
| Clean (none) | 40.2% |
| Telephone effect | 20.3% |
| Background music | 19.8% |
| Noise injection | 19.7% |
### Duration
- **Min**: 18.5s
- **Max**: 79.8s
- **Average**: 64.1s
- **Total audio**: ~8,900 hours
### Vocal Burst Categories
All **82 categories** from the DACVAE taxonomy are represented, with balanced usage across categories. Each burst is used approximately 62 times across the dataset.
**Label source**: 53.7% Gemini captions, 46.3% original taxonomy prompts.
## Intended Use
This dataset is designed for training and fine-tuning ASR models that need to:
- **Transcribe speech with inline vocal burst annotations** (e.g., `[laughs]`, `[sighs]`)
- **Perform speaker diarization** (identify who is speaking)
- **Generate sentence-level timestamps** (precise start/end times for each segment)
- **Handle noisy/degraded audio** (telephone, background noise, music)
- **Support multilingual transcription** (EN, DE, FR, JA, ZH)
## License
CC-BY-4.0 — Attribution required.
### Attribution
- Speech data: [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), [laion/Emolia](https://huggingface.co/datasets/laion/Emolia)
- Vocal bursts: [TTS-AGI/vocal-bursts-taxonomy-DACVAE](https://huggingface.co/datasets/TTS-AGI/vocal-bursts-taxonomy-DACVAE), derived from [krishnakalyan3/vocal_bursts_taxonomy_100_clean_wds](https://huggingface.co/datasets/krishnakalyan3/vocal_bursts_taxonomy_100_clean_wds)
- Background music: [laion/laion-tunes-rpg-music](https://huggingface.co/datasets/laion/laion-tunes-rpg-music)
提供机构:
TTS-AGI



