surindersinghssj/gurbani-asr-ctc-aligned

Name: surindersinghssj/gurbani-asr-ctc-aligned
Creator: surindersinghssj
Published: 2026-03-26 15:18:06
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/surindersinghssj/gurbani-asr-ctc-aligned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: pa tags: - punjabi - gurbani - speech-recognition - forced-alignment - kirtan - speech-to-text - ctc - mms dataset_info: features: - name: audio dtype: audio - name: segment_id dtype: string - name: recording_id dtype: string - name: sentence dtype: string - name: whisper_text dtype: string - name: mms_text dtype: string - name: tuk_index dtype: int32 - name: start dtype: float32 - name: end dtype: float32 - name: duration dtype: float32 - name: match_score dtype: float32 - name: avg_confidence dtype: float32 - name: ctc_alignment_score dtype: float32 - name: boundary_shift_ms dtype: float32 - name: repetition dtype: int32 - name: partition_type dtype: string - name: pipeline dtype: string - name: ang dtype: int32 - name: style_bucket dtype: string - name: artist_name dtype: string - name: shabad_id dtype: int64 - name: raag dtype: string - name: writer dtype: string splits: - name: train num_bytes: 84051968 num_examples: 291 config_name: default configs: - config_name: default data_files: - split: train path: data/train-* size_categories: - n<1K task_categories: - automatic-speech-recognition --- # Gurbani ASR CTC-Aligned Dataset A forced-aligned corpus of Gurbani kirtan audio segments with **CTC-refined boundaries** using Meta's MMS (Massively Multilingual Speech) forced aligner. This is the CTC-refined companion to [`gurbani-asr-whisper-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned). Both datasets share identical segments and ground-truth labels. The difference is in boundary precision: this dataset refines Whisper's attention-based word boundaries using CTC acoustic alignment, producing tighter audio cuts. ## Dataset Summary | | | |---|---| | **Language** | Punjabi (Gurmukhi script) | | **Domain** | Gurbani kirtan (devotional singing) | | **Audio format** | FLAC, 16 kHz, mono | | **Segment duration** | 1–30 seconds | | **Source** | SikhNet kirtan tracks with `shabadId` linking to STTM | | **Ground truth** | STTM database — exact canonical Unicode Gurmukhi | | **Alignment** | Whisper large-v2 timestamps → MMS CTC forced alignment refinement | ## How CTC Refinement Works This dataset starts from the same Whisper timestamp alignment as the companion whisper-aligned dataset, then adds a CTC refinement step: 1. **Whisper alignment** (same as companion dataset): Whisper large-v2 provides word-level timestamps, matched to canonical STTM tuks via matra-normalised F1 scoring 2. **CTC refinement**: For each Whisper-matched segment: - Extract audio with ±0.5s padding around Whisper boundaries - Run MMS forced aligner (CTC-based) on the padded audio - Compute token-level frame alignments against the canonical text - Adjust start/end times based on acoustic CTC boundaries - Calculate alignment confidence score and boundary shift 3. **Fallback**: If CTC alignment fails for a segment, Whisper boundaries are kept (pipeline marked as `whisper_v2_gpu+mms_ctc_fallback`) ### Why CTC Refinement? Whisper's attention-based timestamps can have ±100–200ms jitter at word boundaries. CTC forced alignment uses frame-level acoustic evidence to snap boundaries to actual speech onset/offset, producing cleaner training segments with less silence padding and fewer cut-off phonemes. ## Features All features from the [whisper-aligned dataset](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned), plus: | Feature | Type | Description | |---------|------|-------------| | `audio` | Audio | FLAC segment with CTC-refined boundaries (16 kHz, mono) | | `segment_id` | string | `{recording_id}_{sequence:04d}` | | `recording_id` | string | MD5 hash of source URL (first 16 chars) | | `sentence` | string | **Ground truth** — canonical Gurmukhi from STTM | | `whisper_text` | string | Raw Whisper transcription (reference only) | | `mms_text` | string | Decoded text from MMS alignment (approximate for Gurmukhi) | | `tuk_index` | int32 | Index of canonical line in source shabad | | `start` | float32 | CTC-refined segment start time (seconds) | | `end` | float32 | CTC-refined segment end time (seconds) | | `duration` | float32 | Segment duration (seconds) | | `match_score` | float32 | F1 score of Whisper words vs canonical text (0.5–1.0) | | `avg_confidence` | float32 | Mean Whisper word confidence (0.0–1.0) | | `ctc_alignment_score` | float32 | Token alignment confidence from MMS CTC | | `boundary_shift_ms` | float32 | Mean absolute shift from Whisper boundaries (ms) | | `repetition` | int32 | How many times this tuk appeared so far in the recording | | `partition_type` | string | `"full"`, `"first_half"`, or `"second_half"` | | `pipeline` | string | `"whisper_v2_gpu+mms_ctc"` or `"whisper_v2_gpu+mms_ctc_fallback"` | | `ang` | int32 | Page number in Guru Granth Sahib | | `style_bucket` | string | Recording style (hazoori, puratan, akj, taksali, live, mixed) | | `artist_name` | string | Kirtani name | | `shabad_id` | int64 | STTM database shabad ID | | `raag` | string | Raag (melodic framework) | | `writer` | string | Composer/saint credited with the verse | ## CTC-Specific Quality Metrics **CTC Alignment Score:** Token-level alignment confidence from MMS forced aligner. - Higher values indicate stronger acoustic evidence for the boundary placement **Boundary Shift (ms):** Average absolute difference between Whisper and CTC boundaries (mean of start shift + end shift). - < 100 ms: CTC confirms Whisper boundaries (good agreement) - 100–300 ms: Moderate correction (typical for word-boundary refinement) - > 300 ms: Large correction (may indicate Whisper boundary error or CTC quirk — inspect manually) **MMS Text:** Decoded text from MMS CTC alignment. Since MMS was primarily trained on Latin scripts, its Gurmukhi decoding is approximate. Use `sentence` (STTM canonical) as the true label — `mms_text` is for diagnostics only. ## Companion Dataset The Whisper-aligned variant (identical segments, Whisper attention boundaries instead of CTC-refined) is at [`surindersinghssj/gurbani-asr-whisper-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned). ## Intended Use - Fine-tuning Whisper or similar ASR models with tighter acoustic boundaries - Comparing CTC vs attention-based alignment for Gurbani ASR training - Curriculum learning (clean studio → noisy live gurdwara audio) - Vocabulary-constrained Gurbani decoding (STTM as closed vocabulary) ## Attribution - **Audio**: [SikhNet](https://play.sikhnet.com) — original artists retain copyright - **Text labels**: [SikhiToTheMax](https://www.sikhitothemaax.org/) database (public domain) - **MMS model**: Meta AI (Massively Multilingual Speech) - **Alignment pipeline**: Surt (Gurbani ASR v3) ## License CC-BY-4.0. Audio sourced from SikhNet for non-commercial research and education.

提供机构：

surindersinghssj

5,000+

优质数据集

54 个

任务类型

进入经典数据集