five

TTS-AGI/vocal-burst-annotation-asr-tuning-dataset

收藏
Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/vocal-burst-annotation-asr-tuning-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en - de - fr - ja - zh task_categories: - automatic-speech-recognition tags: - vocal-bursts - speaker-diarization - timestamps - augmented - multi-speaker - multilingual pretty_name: Vocal Burst Annotation ASR Tuning Dataset size_categories: - 100K<n<1M --- # Vocal Burst Annotation ASR Tuning Dataset A synthetic **500,000-sample** multilingual dataset for training ASR models with **inline vocal burst captioning**, **speaker diarization**, and **sentence-level timestamps**. Each sample is approximately 1 minute of audio containing speech segments interleaved with vocal bursts (laughs, sighs, coughs, etc.), annotated with precise timing information. ## Example Transcript ``` [nasalized, affirmative hum, steady pitch, moderate intensity] <Speaker_1> Hello world, this is a test. [breathy, staccato, high-pitched laugh, moderate intensity] <Speaker_2> How are you doing today? <Speaker_1> I'm fine, thank you very much. [Trembling Whimper faint cry indicating fear or pain] <Speaker_2> That sounds great, let me tell you about my day. ``` ## Dataset Construction ### Source Data 1. **Speech segments**: Drawn from multiple multilingual speech datasets: - [TTS-AGI/emolia-hq](https://huggingface.co/datasets/TTS-AGI/emolia-hq) — English, Chinese, Japanese (Emilia HQ) - [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) — German, French 2. **Vocal bursts**: 8,200 samples from 82 categories, sourced from: - [TTS-AGI/vocal-bursts-taxonomy-DACVAE](https://huggingface.co/datasets/TTS-AGI/vocal-bursts-taxonomy-DACVAE) — with Gemini Flash Lite verification and dense captions 3. **Background music**: [laion/laion-tunes-rpg-music](https://huggingface.co/datasets/laion/laion-tunes-rpg-music) — instrumental RPG music for background augmentation ### Construction Pipeline Each ~60-second sample is built by: 1. **Speaker selection**: 1–4 speakers chosen per sample (50% single-speaker, 50% multi-speaker) 2. **Speech concatenation**: Speech snippets from chosen speakers are concatenated sequentially 3. **Vocal burst insertion**: Between speech segments, vocal bursts are inserted with a 33% probability. There is also a 33% chance of a vocal burst at the very beginning. 4. **Vocal burst augmentation**: Each inserted vocal burst undergoes a random speed change of ±10% 5. **Global augmentations** (mutually exclusive): - 20% — Telephone effect (downsample to 8kHz, upsample back) - 20% — Noise injection (light Gaussian noise) - 20% — Background music overlay (10–25% of speech volume) - 40% — Clean (no augmentation) ### Vocal Burst Labeling Labels are chosen based on Gemini Flash Lite verification scores: - **Score 0 or 1** (poor/slight match): Always use the Gemini dense caption (e.g., `nasalized, affirmative hum, steady pitch, moderate intensity`) - **Score 2** (well matched): 50% chance Gemini caption, 50% chance original taxonomy prompt (e.g., `Affirmative Grunt short sound indicating agreement`) ## Output Format - **Audio**: 24kHz mono MP3, 64kbps - **Packaging**: WebDataset tar shards (1,000 samples per shard) - Each sample consists of `{key}.mp3` + `{key}.json` ### JSON Metadata Structure ```json { "transcript": "<Speaker_1> Hello world. [breathy laugh] <Speaker_2> How are you?", "segments": [ { "type": "speech", "speaker": "EN_B00045_S00003", "speaker_label": "Speaker_1", "text": "Hello world.", "language": "en", "source_key": "EN_B00045_S00003_W000012", "start": 0.0, "end": 3.5, "duration": 3.5 }, { "type": "vocal_burst", "label_used": "breathy, staccato, high-pitched laugh, moderate intensity", "prompt": "Cackle loud raucous laugh often with a sharp edge", "gemini_caption": "breathy, staccato, high-pitched laugh, moderate intensity", "gemini_match_score": 2, "category": "Cackle", "key": "female/sample000123", "gender": "female", "speed_factor": 1.05, "start": 3.5, "end": 11.2, "duration": 7.7 }, ... ], "augmentations": ["telephone"], "speakers": ["EN_B00045_S00003", "EN_B00045_S00013"], "language": "en", "duration": 63.5, "num_speakers": 2 } ``` ## Statistics (sampled from 20,000 samples) ### Vocal Burst Distribution | Vocal Bursts per Sample | Percentage | |---|---| | 0 (no bursts) | 6.0% | | 1 | 26.0% | | 2 | 37.8% | | 3 | 23.6% | | 4 | 6.1% | | 5+ | 0.5% | **94.0% of samples contain at least one vocal burst.** ### Language Distribution | Language | Percentage | |---|---| | English (en) | 36.9% | | French (fr) | 20.2% | | German (de) | 17.9% | | Japanese (ja) | 16.6% | | Chinese (zh) | 8.4% | ### Speaker Count Distribution | Speakers | Percentage | |---|---| | 1 speaker | 49.5% | | 2 speakers | 17.1% | | 3 speakers | 16.4% | | 4 speakers | 17.0% | ### Augmentation Distribution | Augmentation | Percentage | |---|---| | Clean (none) | 40.2% | | Telephone effect | 20.3% | | Background music | 19.8% | | Noise injection | 19.7% | ### Duration - **Min**: 18.5s - **Max**: 79.8s - **Average**: 64.1s - **Total audio**: ~8,900 hours ### Vocal Burst Categories All **82 categories** from the DACVAE taxonomy are represented, with balanced usage across categories. Each burst is used approximately 62 times across the dataset. **Label source**: 53.7% Gemini captions, 46.3% original taxonomy prompts. ## Intended Use This dataset is designed for training and fine-tuning ASR models that need to: - **Transcribe speech with inline vocal burst annotations** (e.g., `[laughs]`, `[sighs]`) - **Perform speaker diarization** (identify who is speaking) - **Generate sentence-level timestamps** (precise start/end times for each segment) - **Handle noisy/degraded audio** (telephone, background noise, music) - **Support multilingual transcription** (EN, DE, FR, JA, ZH) ## License CC-BY-4.0 — Attribution required. ### Attribution - Speech data: [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset), [laion/Emolia](https://huggingface.co/datasets/laion/Emolia) - Vocal bursts: [TTS-AGI/vocal-bursts-taxonomy-DACVAE](https://huggingface.co/datasets/TTS-AGI/vocal-bursts-taxonomy-DACVAE), derived from [krishnakalyan3/vocal_bursts_taxonomy_100_clean_wds](https://huggingface.co/datasets/krishnakalyan3/vocal_bursts_taxonomy_100_clean_wds) - Background music: [laion/laion-tunes-rpg-music](https://huggingface.co/datasets/laion/laion-tunes-rpg-music)
提供机构:
TTS-AGI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作