surindersinghssj/gurbani-asr-whisper-aligned
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/surindersinghssj/gurbani-asr-whisper-aligned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language: pa
tags:
- punjabi
- gurbani
- speech-recognition
- forced-alignment
- kirtan
- speech-to-text
- whisper
dataset_info:
features:
- name: audio
dtype: audio
- name: segment_id
dtype: string
- name: recording_id
dtype: string
- name: sentence
dtype: string
- name: whisper_text
dtype: string
- name: tuk_index
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
- name: duration
dtype: float32
- name: match_score
dtype: float32
- name: avg_confidence
dtype: float32
- name: repetition
dtype: int32
- name: partition_type
dtype: string
- name: pipeline
dtype: string
- name: ang
dtype: int32
- name: style_bucket
dtype: string
- name: artist_name
dtype: string
- name: shabad_id
dtype: int64
- name: raag
dtype: string
- name: writer
dtype: string
splits:
- name: train
num_bytes: 84051968
num_examples: 291
config_name: default
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
size_categories:
- n<1K
task_categories:
- automatic-speech-recognition
---
# Gurbani ASR Whisper-Aligned Dataset
A forced-aligned corpus of Gurbani kirtan audio segments paired with canonical ground-truth Gurmukhi text from the SikhiToTheMax (STTM) database. Built for training low-resource Gurbani speech recognition models.
Whisper large-v2 provides word-level timestamps; canonical STTM text provides ground-truth labels. Whisper is never used as a transcriber — only as a timestamp oracle.
## Dataset Summary
| | |
|---|---|
| **Language** | Punjabi (Gurmukhi script) |
| **Domain** | Gurbani kirtan (devotional singing) |
| **Audio format** | FLAC, 16 kHz, mono |
| **Segment duration** | 1–30 seconds |
| **Source** | SikhNet kirtan tracks with `shabadId` linking to STTM |
| **Ground truth** | STTM database — exact canonical Unicode Gurmukhi |
| **Alignment** | Whisper large-v2 word timestamps + matra-normalised F1 matching |
## How the Data Was Created
### 1. Source Selection
Kirtan tracks were scraped from SikhNet's artist APIs. Only tracks with a `shabadId` field (linking to the STTM database) were included, filtered by:
- Duration: 2–90 minutes
- Gurmukhi content ratio: ≥80%
- Exclusion of katha, akhand path, and non-kirtan content
### 2. Canonical Text Lookup
Each track's `shabadId` maps to exact canonical Gurbani text from the local STTM `database.sqlite`. This provides the ground-truth training labels — shabad lines (tuks), ang (page number), raag, and writer attribution.
### 3. Forced Alignment
**Stage 1 — Whisper timestamp extraction:**
- Model: Whisper large-v2 with `word_timestamps=True`, `language="pa"`, `beam_size=3`
- Output: word-level timestamps `{word, start, end, probability}`
**Stage 2 — Canonical text matching:**
1. **Matra stripping**: Gurmukhi vowel signs (ਾਿੀੁੂੇੈੋੌੰੱ) are removed to compare consonant skeletons
2. **Vishram splitting**: Lines with primary vishram (`;`) are expanded into three match targets — full line, first half, second half — to handle kirtanis singing each half separately
3. **Forward-scan F1**: A sliding window over Whisper words computes F1 overlap with each canonical tuk partition
4. **Thresholds**: Match score ≥ 0.5, segment duration 1–30s, avg word confidence ≥ 0.3
5. **Repetition tracking**: Same tuk can match multiple times (captures kirtan repetitions)
### 4. Segmentation
Audio is extracted at Whisper-determined word boundaries and saved as FLAC (16 kHz, mono). The training label is always the canonical STTM text.
## Features
| Feature | Type | Description |
|---------|------|-------------|
| `audio` | Audio | FLAC-encoded kirtan segment (16 kHz, mono) |
| `segment_id` | string | `{recording_id}_{sequence:04d}` |
| `recording_id` | string | MD5 hash of source URL (first 16 chars) |
| `sentence` | string | **Ground truth** — canonical Gurmukhi from STTM |
| `whisper_text` | string | Raw Whisper transcription (reference only) |
| `tuk_index` | int32 | Index of canonical line in source shabad |
| `start` | float32 | Segment start time (seconds) |
| `end` | float32 | Segment end time (seconds) |
| `duration` | float32 | Segment duration (seconds) |
| `match_score` | float32 | F1 score of Whisper words vs canonical text (0.5–1.0) |
| `avg_confidence` | float32 | Mean Whisper word confidence (0.0–1.0) |
| `repetition` | int32 | How many times this tuk appeared so far in the recording |
| `partition_type` | string | `"full"`, `"first_half"`, or `"second_half"` |
| `pipeline` | string | `"whisper_v2_gpu"` |
| `ang` | int32 | Page number in Guru Granth Sahib |
| `style_bucket` | string | Recording style (hazoori, puratan, akj, taksali, live, mixed) |
| `artist_name` | string | Kirtani name |
| `shabad_id` | int64 | STTM database shabad ID |
| `raag` | string | Raag (melodic framework) |
| `writer` | string | Composer/saint credited with the verse |
## Quality Metrics
**Match Score (F1):** Overlap between Whisper-detected words and canonical STTM text after matra stripping.
- ≥ 0.8: high confidence match
- 0.6–0.8: good match
- 0.5–0.6: acceptable but noisy
**Avg Confidence:** Mean Whisper word-level probability.
- ≥ 0.7: Whisper confident
- 0.5–0.7: moderate
- < 0.5: Whisper uncertain (noisy audio)
## Why Ground-Truth STTM Text?
Using Whisper's own transcription as training labels creates a closed loop where errors propagate. Instead, this dataset uses:
1. **STTM database** as the single source of truth for Gurbani text
2. **Whisper timestamps only** to locate where each word occurs in the audio
3. **Matra-normalised fuzzy matching** to handle Whisper's imperfect transcription
Even if Whisper misidentifies words, the canonical label remains correct.
## Intended Use
- Fine-tuning Whisper or similar ASR models on authentic kirtan audio
- Curriculum learning (clean studio recordings → noisy live gurdwara audio)
- Vocabulary-constrained Gurbani decoding (STTM as closed vocabulary)
- Shabad search: live audio → matched shabad via BM25 index
## Companion Dataset
A CTC-refined variant using Meta's MMS forced aligner is available at [`surindersinghssj/gurbani-asr-ctc-aligned`](https://huggingface.co/datasets/surindersinghssj/gurbani-asr-ctc-aligned), with additional columns for CTC alignment scores and boundary shift metrics.
## Attribution
- **Audio**: [SikhNet](https://play.sikhnet.com) — original artists retain copyright
- **Text labels**: [SikhiToTheMax](https://www.sikhitothemaax.org/) database (public domain)
- **Alignment pipeline**: Surt (Gurbani ASR v3)
## License
CC-BY-4.0. Audio sourced from SikhNet for non-commercial research and education.
提供机构:
surindersinghssj



