surindersinghssj/gurbani-sehajpath-yt-captions
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/surindersinghssj/gurbani-sehajpath-yt-captions
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: "data/*.parquet"
---
# Gurbani YouTube caption dataset
Built incrementally by `scripts/bulk_yt_caption_pipeline.py` using the
native-webm + UI-transcript-line caption-mode pipeline (see
`feedback_yt_caption_pipeline.md` in the agent memory for the method).
Each parquet shard in `data/` is an **additive batch**; new shards are
uploaded without rewriting previous ones. `datasets.load_dataset(repo)`
returns the union of all shards as a single `train` split thanks to the
glob in the YAML config above.
- sources: YouTube auto-captions (`pa-orig`) from per-video provenance
in the `video_id` column
- pipeline rules: native `webm/opus` download on IPv4; no resample at
master stage; clip boundaries = YouTube transcript-panel UI lines
(pair consecutive content events); `caption_offset_s = 0`
- Last snapshot: 2026-04-18 23:52:32
提供机构:
surindersinghssj



