surindersinghssj/gurbani-sehajpath-yt-captions-canonical
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/surindersinghssj/gurbani-sehajpath-yt-captions-canonical
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
language:
- pa
tags:
- gurbani
- sehaj-path
- gurmukhi
- punjabi
---
# Gurbani Sehajpath — Canonical-aligned ASR corpus
Stage-1 + Stage-2 canonical pipeline output for **sehaj-path** (calm recitation of the Guru Granth Sahib). Built from publicly available audio recordings with aligned transcripts, chunked by caption timing and aligned against the canonical Guru Granth Sahib Ji text (SGGS).
## Columns
Schema is auto-inferred from the parquet shards. Primary columns:
- `audio` — 16 kHz mono waveform
- `final_text` — canonical Gurmukhi transcription (post Stage-2 alignment, recommended for training)
- `text` / `raw_text` — earlier-stage text fields for comparison / ablation
- `clip_id`, `video_id`, `start_s`, `end_s`, `duration_s` — provenance + timing
- `sggs_line`, `canonical_shabad_id`, `canonical_line_ids` — SGGS alignment targets
- `canonical_match_score`, `canonical_retrieval_margin`, `canonical_op_counts` — alignment quality signals
- `is_simran`, `decision` — segment classifications
- `caption_lang`, `caption_offset_s`, `n_cues`, `clip_mode` — caption-pipeline metadata
## Intended use
Training / fine-tuning automatic speech recognition models for Gurbani sehaj-path audio → Gurmukhi transcription. Used as one of the primary training sources for [`surindersinghssj/surt-small-v3`](https://huggingface.co/surindersinghssj/surt-small-v3).
## Related
- Eval split (held-out): [`gurbani-sehajpath-yt-captions-eval-canonical`](https://huggingface.co/datasets/surindersinghssj/gurbani-sehajpath-yt-captions-eval-canonical)
- Companion kirtan corpus: [`gurbani-kirtan-yt-captions-300h-canonical`](https://huggingface.co/datasets/surindersinghssj/gurbani-kirtan-yt-captions-300h-canonical)
- Older studio sehaj corpus: [`gurbani-sehajpath`](https://huggingface.co/datasets/surindersinghssj/gurbani-sehajpath)
## License
CC BY 4.0.
---
许可证:CC BY 4.0
任务类别:
- 自动语音识别(automatic-speech-recognition)
语言:
- 旁遮普语(pa)
标签:
- 古尔巴尼(Gurbani)
- 平静诵读(Sehaj Path)
- 古木基文(Gurmukhi)
- 旁遮普语(Punjabi)
---
# 古尔巴尼平静诵读——与经典对齐的自动语音识别语料库
本语料库为**平静诵读(Sehaj Path,即对《古鲁格兰特·萨希卜·吉》(Sri Guru Granth Sahib Ji,简称SGGS)的平缓诵读)**的两阶段经典流水线输出结果。其基于公开可获取的带对齐字幕的音频录制内容,按字幕时间轴进行切片,并与标准《古鲁格兰特·萨希卜·吉》文本完成对齐。
## 字段说明
数据 schema 从 Parquet 分片自动推导得出,核心字段如下:
- `audio`:16 kHz 单声道波形音频
- `final_text`:经过第二阶段对齐后的标准古木基文(Gurmukhi)转录文本(推荐用于模型训练)
- `text` / `raw_text`:早期阶段的文本字段,用于对比分析或消融实验
- `clip_id`、`video_id`、`start_s`、`end_s`、`duration_s`:数据来源与时间戳信息
- `sggs_line`、`canonical_shabad_id`、`canonical_line_ids`:与《古鲁格兰特·萨希卜·吉》对齐的目标字段
- `canonical_match_score`、`canonical_retrieval_margin`、`canonical_op_counts`:对齐质量评估指标
- `is_simran`、`decision`:音频片段分类标签
- `caption_lang`、`caption_offset_s`、`n_cues`、`clip_mode`:字幕流水线元数据
## 适用场景
用于针对古尔巴尼平静诵读音频到古木基文转录的自动语音识别模型的训练与微调,是 [`surindersinghssj/surt-small-v3`](https://huggingface.co/surindersinghssj/surt-small-v3) 的核心训练语料之一。
## 相关资源
- 测试划分集(预留验证集):[`gurbani-sehajpath-yt-captions-eval-canonical`](https://huggingface.co/datasets/surindersinghssj/gurbani-sehajpath-yt-captions-eval-canonical)
- 配套的诵经语料库:[`gurbani-kirtan-yt-captions-300h-canonical`](https://huggingface.co/datasets/surindersinghssj/gurbani-kirtan-yt-captions-300h-canonical)
- 早期工作室录制的平静诵读语料库:[`gurbani-sehajpath`](https://huggingface.co/datasets/surindersinghssj/gurbani-sehajpath)
## 许可证
CC BY 4.0。
提供机构:
surindersinghssj



