five

surindersinghssj/gurbani-asr-whisper-aligned

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/surindersinghssj/gurbani-asr-whisper-aligned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: pa tags: - punjabi - gurbani - speech-recognition - forced-alignment - kirtan - speech-to-text - whisper dataset_info: features: - name: audio dtype: audio - name: segment_id dtype: string - name: recording_id dtype: string - name: sentence dtype: string - name: whisper_text dtype: string - name: tuk_index dtype: int32 - name: start dtype: float32 - name: end dtype: float32 - name: duration dtype: float32 - name: match_score dtype: float32 - name: avg_confidence dtype: float32 - name: repetition dtype: int32 - name: partition_type dtype: string - name: pipeline dtype: string - name: ang dtype: int32 - name: style_bucket dtype: string - name: artist_name dtype: string - name: shabad_id dtype: int64 - name: raag dtype: string - name: writer dtype: string splits: - name: train num_bytes: 84051968 num_examples: 291 config_name: default configs: - config_name: default data_files: - split: train path: data/train-* size_categories: - n<1K task_categories: - automatic-speech-recognition --- # Gurbani ASR Whisper-Aligned Dataset A forced-aligned corpus of Gurbani kirtan audio segments paired with canonical ground-truth Gurmukhi text from the SikhiToTheMax (STTM) database. Built for training low-resource Gurbani speech recognition models. Whisper large-v2 provides word-level timestamps; canonical STTM text provides ground-truth labels. Whisper is never used as a transcriber — only as a timestamp oracle. ## Dataset Summary | | | |---|---| | **Language** | Punjabi (Gurmukhi script) | | **Domain** | Gurbani kirtan (devotional singing) | | **Audio format** | FLAC, 16 kHz, mono | | **Segment duration** | 1–30 seconds | | **Source** | SikhNet kirtan tracks with `shabadId` linking to STTM | | **Ground truth** | STTM database — exact canonical Unicode Gurmukhi | | **Alignment** | Whisper large-v2 word timestamps + matra-normalised F1 matching | ## How the Data Was Created ### 1. Source Selection Kirtan tracks were scraped from SikhNet's artist APIs. Only tracks with a `shabadId` field (linking to the STTM database) were included, filtered by: - Duration: 2–90 minutes - Gurmukhi content ratio: ≥80% - Exclusion of katha, akhand path, and non-kirtan content ### 2. Canonical Text Lookup Each track's `shabadId` maps to exact canonical Gurbani text from the local STTM `database.sqlite`. This provides the ground-truth training labels — shabad lines (tuks), ang (page number), raag, and writer attribution. ### 3. Forced Alignment **Stage 1 — Whisper timestamp extraction:** - Model: Whisper large-v2 with `word_timestamps=True`, `language="pa"`, `beam_size=3` - Output: word-level timestamps `{word, start, end, probability}` **Stage 2 — Canonical text matching:** 1. **Matra stripping**: Gurmukhi vowel signs (ਾਿੀੁੂੇੈੋੌੰੱ) are removed to compare consonant skeletons 2. **Vishram splitting**: Lines with primary vishram (`;`) are expanded into three match targets — full line, first half, second half — to handle kirtanis singing each half separately 3. **Forward-scan F1**: A sliding window over Whisper words computes F1 overlap with each canonical tuk partition 4. **Thresholds**: Match score ≥ 0.5, segment duration 1–30s, avg word confidence ≥ 0.3 5. **Repetition tracking**: Same tuk can match multiple times (captures kirtan repetitions) ### 4. Segmentation Audio is extracted at Whisper-determined word boundaries and saved as FLAC (16 kHz, mono). The training label is always the canonical STTM text. ## Features | Feature | Type | Description | |---------|------|-------------| | `audio` | Audio | FLAC-encoded kirtan segment (16 kHz, mono) | | `segment_id` | string | `{recording_id}_{sequence:04d}` | | `recording_id` | string | MD5 hash of source URL (first 16 chars) | | `sentence` | string | **Ground truth** — canonical Gurmukhi from STTM | | `whisper_text` | string | Raw Whisper transcription (reference only) | | `tuk_index` | int32 | Index of canonical line in source shabad | | `start` | float32 | Segment start time (seconds) | | `end` | float32 | Segment end time (seconds) | | `duration` | float32 | Segment duration (seconds) | | `match_score` | float32 | F1 score of Whisper words vs canonical text (0.5–1.0) | | `avg_confidence` | float32 | Mean Whisper word confidence (0.0–1.0) | | `repetition` | int32 | How many times this tuk appeared so far in the recording | | `partition_type` | string | `"full"`, `"first_half"`, or `"second_half"` | | `pipeline` | string | `"whisper_v2_gpu"` | | `ang` | int32 | Page number in Guru Granth Sahib | | `style_bucket` | string | Recording style (hazoori, puratan, akj, taksali, live, mixed) | | `artist_name` | string | Kirtani name | | `shabad_id` | int64 | STTM database shabad ID | | `raag` | string | Raag (melodic framework) | | `writer` | string | Composer/saint credited with the verse | ## Quality Metrics **Match Score (F1):** Overlap between Whisper-detected words and canonical STTM text after matra stripping. - ≥ 0.8: high confidence match - 0.6–0.8: good match - 0.5–0.6: acceptable but noisy **Avg Confidence:** Mean Whisper word-level probability. - ≥ 0.7: Whisper confident - 0.5–0.7: moderate - < 0.5: Whisper uncertain (noisy audio) ## Why Ground-Truth STTM Text? Using Whisper's own transcription as training labels creates a closed loop where errors propagate. Instead, this dataset uses: 1. **STTM database** as the single source of truth for Gurbani text 2. **Whisper timestamps only** to locate where each word occurs in the audio 3. **Matra-normalised fuzzy matching** to handle Whisper's imperfect transcription Even if Whisper misidentifies words, the canonical label remains correct. ## Intended Use - Fine-tuning Whisper or similar ASR models on authentic kirtan audio - Curriculum learning (clean studio recordings → noisy live gurdwara audio) - Vocabulary-constrained Gurbani decoding (STTM as closed vocabulary) - Shabad search: live audio → matched shabad via BM25 index ## Companion Dataset A CTC-refined variant using Meta's MMS forced aligner is available at [`surindersinghssj/gurbani-asr-ctc-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-ctc-aligned), with additional columns for CTC alignment scores and boundary shift metrics. ## Attribution - **Audio**: [SikhNet](https://play.sikhnet.com) — original artists retain copyright - **Text labels**: [SikhiToTheMax](https://www.sikhitothemaax.org/) database (public domain) - **Alignment pipeline**: Surt (Gurbani ASR v3) ## License CC-BY-4.0. Audio sourced from SikhNet for non-commercial research and education.
提供机构:
surindersinghssj
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作