surindersinghssj/gurbani-asr-ctc-aligned
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/surindersinghssj/gurbani-asr-ctc-aligned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language: pa
tags:
- punjabi
- gurbani
- speech-recognition
- forced-alignment
- kirtan
- speech-to-text
- ctc
- mms
dataset_info:
features:
- name: audio
dtype: audio
- name: segment_id
dtype: string
- name: recording_id
dtype: string
- name: sentence
dtype: string
- name: whisper_text
dtype: string
- name: mms_text
dtype: string
- name: tuk_index
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
- name: duration
dtype: float32
- name: match_score
dtype: float32
- name: avg_confidence
dtype: float32
- name: ctc_alignment_score
dtype: float32
- name: boundary_shift_ms
dtype: float32
- name: repetition
dtype: int32
- name: partition_type
dtype: string
- name: pipeline
dtype: string
- name: ang
dtype: int32
- name: style_bucket
dtype: string
- name: artist_name
dtype: string
- name: shabad_id
dtype: int64
- name: raag
dtype: string
- name: writer
dtype: string
splits:
- name: train
num_bytes: 84051968
num_examples: 291
config_name: default
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
size_categories:
- n<1K
task_categories:
- automatic-speech-recognition
---
# Gurbani ASR CTC-Aligned Dataset
A forced-aligned corpus of Gurbani kirtan audio segments with **CTC-refined boundaries** using Meta's MMS (Massively Multilingual Speech) forced aligner. This is the CTC-refined companion to [`gurbani-asr-whisper-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned).
Both datasets share identical segments and ground-truth labels. The difference is in boundary precision: this dataset refines Whisper's attention-based word boundaries using CTC acoustic alignment, producing tighter audio cuts.
## Dataset Summary
| | |
|---|---|
| **Language** | Punjabi (Gurmukhi script) |
| **Domain** | Gurbani kirtan (devotional singing) |
| **Audio format** | FLAC, 16 kHz, mono |
| **Segment duration** | 1–30 seconds |
| **Source** | SikhNet kirtan tracks with `shabadId` linking to STTM |
| **Ground truth** | STTM database — exact canonical Unicode Gurmukhi |
| **Alignment** | Whisper large-v2 timestamps → MMS CTC forced alignment refinement |
## How CTC Refinement Works
This dataset starts from the same Whisper timestamp alignment as the companion whisper-aligned dataset, then adds a CTC refinement step:
1. **Whisper alignment** (same as companion dataset): Whisper large-v2 provides word-level timestamps, matched to canonical STTM tuks via matra-normalised F1 scoring
2. **CTC refinement**: For each Whisper-matched segment:
- Extract audio with ±0.5s padding around Whisper boundaries
- Run MMS forced aligner (CTC-based) on the padded audio
- Compute token-level frame alignments against the canonical text
- Adjust start/end times based on acoustic CTC boundaries
- Calculate alignment confidence score and boundary shift
3. **Fallback**: If CTC alignment fails for a segment, Whisper boundaries are kept (pipeline marked as `whisper_v2_gpu+mms_ctc_fallback`)
### Why CTC Refinement?
Whisper's attention-based timestamps can have ±100–200ms jitter at word boundaries. CTC forced alignment uses frame-level acoustic evidence to snap boundaries to actual speech onset/offset, producing cleaner training segments with less silence padding and fewer cut-off phonemes.
## Features
All features from the [whisper-aligned dataset](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned), plus:
| Feature | Type | Description |
|---------|------|-------------|
| `audio` | Audio | FLAC segment with CTC-refined boundaries (16 kHz, mono) |
| `segment_id` | string | `{recording_id}_{sequence:04d}` |
| `recording_id` | string | MD5 hash of source URL (first 16 chars) |
| `sentence` | string | **Ground truth** — canonical Gurmukhi from STTM |
| `whisper_text` | string | Raw Whisper transcription (reference only) |
| `mms_text` | string | Decoded text from MMS alignment (approximate for Gurmukhi) |
| `tuk_index` | int32 | Index of canonical line in source shabad |
| `start` | float32 | CTC-refined segment start time (seconds) |
| `end` | float32 | CTC-refined segment end time (seconds) |
| `duration` | float32 | Segment duration (seconds) |
| `match_score` | float32 | F1 score of Whisper words vs canonical text (0.5–1.0) |
| `avg_confidence` | float32 | Mean Whisper word confidence (0.0–1.0) |
| `ctc_alignment_score` | float32 | Token alignment confidence from MMS CTC |
| `boundary_shift_ms` | float32 | Mean absolute shift from Whisper boundaries (ms) |
| `repetition` | int32 | How many times this tuk appeared so far in the recording |
| `partition_type` | string | `"full"`, `"first_half"`, or `"second_half"` |
| `pipeline` | string | `"whisper_v2_gpu+mms_ctc"` or `"whisper_v2_gpu+mms_ctc_fallback"` |
| `ang` | int32 | Page number in Guru Granth Sahib |
| `style_bucket` | string | Recording style (hazoori, puratan, akj, taksali, live, mixed) |
| `artist_name` | string | Kirtani name |
| `shabad_id` | int64 | STTM database shabad ID |
| `raag` | string | Raag (melodic framework) |
| `writer` | string | Composer/saint credited with the verse |
## CTC-Specific Quality Metrics
**CTC Alignment Score:** Token-level alignment confidence from MMS forced aligner.
- Higher values indicate stronger acoustic evidence for the boundary placement
**Boundary Shift (ms):** Average absolute difference between Whisper and CTC boundaries (mean of start shift + end shift).
- < 100 ms: CTC confirms Whisper boundaries (good agreement)
- 100–300 ms: Moderate correction (typical for word-boundary refinement)
- > 300 ms: Large correction (may indicate Whisper boundary error or CTC quirk — inspect manually)
**MMS Text:** Decoded text from MMS CTC alignment. Since MMS was primarily trained on Latin scripts, its Gurmukhi decoding is approximate. Use `sentence` (STTM canonical) as the true label — `mms_text` is for diagnostics only.
## Companion Dataset
The Whisper-aligned variant (identical segments, Whisper attention boundaries instead of CTC-refined) is at [`surindersinghssj/gurbani-asr-whisper-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-whisper-aligned).
## Intended Use
- Fine-tuning Whisper or similar ASR models with tighter acoustic boundaries
- Comparing CTC vs attention-based alignment for Gurbani ASR training
- Curriculum learning (clean studio → noisy live gurdwara audio)
- Vocabulary-constrained Gurbani decoding (STTM as closed vocabulary)
## Attribution
- **Audio**: [SikhNet](https://play.sikhnet.com) — original artists retain copyright
- **Text labels**: [SikhiToTheMax](https://www.sikhitothemaax.org/) database (public domain)
- **MMS model**: Meta AI (Massively Multilingual Speech)
- **Alignment pipeline**: Surt (Gurbani ASR v3)
## License
CC-BY-4.0. Audio sourced from SikhNet for non-commercial research and education.
提供机构:
surindersinghssj



