frovolts/Lipi-Ghor-bn-882-SSTT

Name: frovolts/Lipi-Ghor-bn-882-SSTT
Creator: frovolts
Published: 2026-04-11 08:48:41
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/frovolts/Lipi-Ghor-bn-882-SSTT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - bn size_categories: - 1K<n<10K task_categories: - automatic-speech-recognition - audio-classification tags: - bengali - bangla - speech - diarization - asr - low-resource - sstt - dl-sprint-4 pretty_name: Lipi-Ghor — Bengali Speech Dataset (bn-882-SSTT) --- # 🗣️ Lipi-Ghor | লিপিঘর — Bengali Speech Dataset (bn-882-SSTT) [![Language](https://img.shields.io/badge/Language-Bengali%20(bn)-green)](https://huggingface.co/datasets/Sanjidh090/Lipi-Ghor-bn-882-SSTT) [![Hours](https://img.shields.io/badge/Audio-882%20hrs-blue)](https://huggingface.co/datasets/Sanjidh090/Lipi-Ghor-bn-882-SSTT) [![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey)](https://creativecommons.org/licenses/by/4.0/) [![DL Sprint](https://img.shields.io/badge/DL%20Sprint-4.0-orange)](https://github.com/Sanjidx090) **Lipi-Ghor** (লিপিঘর, meaning *"House of Scripts"*) is a large-scale Bengali speech dataset designed for automatic speech recognition (ASR), speaker diarization, and spoken language research. It is one of the largest open Bengali speech corpora with aligned speaker, transcription, and timestamp annotations. Built by **Team_Villagers** as part of **DL Sprint 4.0**. --- ## Dataset Details ### Dataset Description Lipi-Ghor is a large-scale, multi-domain Bengali speech corpus covering ~882 hours of audio sourced from 1,019 YouTube videos across 596 unique channels. Each video has been processed through speaker diarization (pyannote-audio) and aligned with Bengali caption transcripts to produce structured SSTT (Speaker, Speech, Transcription, Timestamp) annotations. The dataset covers a wide range of spoken Bengali domains, registers, and regional dialects, making it one of the most diverse open Bengali speech resources available. - **Curated by:** Team_Villagers — Sanjid Hasan, A H M Fuad, Risalat Labib, Bayazid Hasan - **Competition:** DL Sprint 4.0 - **Language(s):** Bengali / Bangla (`bn`) - **License:** CC BY 4.0 ### Dataset Sources - **Repository:** [Sanjidh090/Lipi-Ghor-bn-882-SSTT](https://huggingface.co/datasets/Sanjidh090/Lipi-Ghor-bn-882-SSTT) - **Source data:** YouTube (public videos with Bengali caption tracks) --- ## Dataset at a Glance | Field | Value | |-------------------------|--------------------------------------------| | **Total hours sourced** | ~882 hours | | **Fully annotated** | ~856 hours (diarization + transcription) | | **Pending upload** | ~194 hours (~321 videos) | | **Total videos** | 1,019 | | **Unique channels** | 596 | | **Language** | Bengali / Bangla (`bn`) | | **Annotation format** | SSTT — Speaker, Speech, Transcription, Timestamp | | **Audio format** | MP3 (pyannote-segmented) | | **Diarization** | pyannote-audio (SOTA) | | **License** | CC BY 4.0 | --- ## Uses ### Direct Use This dataset is intended for: - **Bengali ASR model training** — fine-tuning Whisper, wav2vec2, MMS, and similar models - **Speaker diarization research** — "who spoke when" tasks in Bengali - **Bengali TTS** — speaker-labeled segments can inform voice synthesis pipelines - **Dialect identification** — the dataset covers Standard Dhaka Bengali, Chittagonian, Sylheti, Rangpuri, and Barishal variants - **Multilingual NLP benchmarking** — Bengali is consistently under-represented in multilingual benchmarks ### Out-of-Scope Use - **Surveillance or speaker re-identification** — speaker labels (`SPEAKER_00`, `SPEAKER_01`, etc.) are local to each video and do not track identity across videos - **High-stakes production ASR without filtering** — the majority of transcripts are sourced from auto-generated YouTube captions and may contain recognition errors; human verification is recommended before deployment in critical applications --- Access this dataset ```python from datasets import load_dataset # Load a sample of Lipi-Ghor dataset = load_dataset("Sanjidh090/Lipi-Ghor-bn-882-SSTT", split="test", streaming=True) sample = next(iter(dataset)) print(f"Speaker: {sample['speaker']}") print(f"Text: {sample['text']}") ``` ## Dataset Structure ``` Lipi-Ghor-bn-882-SSTT/ ├── data/ # Audio segments (.mp3, pyannote-segmented) ├── diarization_results/ # Per-video diarization output (*_output.json) ├── diarization_results_with_transcription/ # Diarization + transcript aligned (*_unified.json) ├── diarization_transcription_final/ # Cleaned final outputs (*_unified.json) └── test/ # Test samples (.wav) ``` ### File Naming Convention All annotation files use the YouTube **video ID** as the base filename: - `{video_id}_output.json` — raw diarization output - `{video_id}_unified.json` — diarization + transcription merged ### Annotation Format (SSTT) Each `_unified.json` contains an array of segments: ```json [ { "speaker": "SPEAKER_00", "start": 12.34, "end": 18.72, "text": "আমরা আজকে এই বিষয়টি নিয়ে কথা বলব।" } ] ``` | Field | Type | Description | |-----------|--------|-------------------------------------| | `speaker` | string | Speaker label from diarization | | `start` | float | Segment start time (seconds) | | `end` | float | Segment end time (seconds) | | `text` | string | Bengali transcript for this segment | --- ## Content & Categories | Category | Videos | Hours | |-----------------------|--------|--------| | Talk-show | 357 | 240.0 | | Audio-book | 248 | 218.3 | | Movie | 31 | 67.3 | | Podcast | 37 | 45.4 | | Cartoon | 56 | 36.3 | | Audiobook (variant) | 17 | 28.7 | | Natok (drama) | 21 | 21.4 | | Bangla Cinema | 14 | 20.0 | | Drama | 20 | 19.9 | | Kirton | 14 | 16.4 | | Waz / Islamic Sermon | 20 | 16.2 | | Kolkata Bangla Movie | 8 | 16.0 | | + 150 more categories | ... | ... | Dialectal coverage includes: Standard Dhaka Bengali, Chittagonian, Sylheti, Rangpuri, and Barishal variants. --- ## Top Channels by Hours | Channel | Videos | Hours | |--------------------------------|--------|--------| | My AudioBook | 229 | 202.4 | | Roy Parrett | 132 | 113.7 | | BanglaVision NEWS | 144 | 97.3 | | Abhijit Story Zone | 92 | 89.9 | | Audio Book Bangla by Faheem | 71 | 87.0 | | ATN Bangla Talk Show | 105 | 86.6 | | GTV News | 70 | 73.9 | | Eso Galpo Shuni | 81 | 63.9 | | Golpo Toru | 63 | 48.9 | | AudioKothon with RAJIA | 52 | 47.2 | --- ## Dataset Creation ### Curation Rationale Bengali is spoken by over 230 million people yet remains severely under-resourced in ASR and spoken language research. Existing open Bengali ASR datasets are typically small (5–40 hours), limited to read speech, and lack speaker annotations. Lipi-Ghor was created to address this gap with a large-scale, multi-domain, diarized corpus that reflects the diversity of real spoken Bengali across dialects, topics, and recording conditions. ### Source Data #### Data Collection and Processing 1. **Video Selection** — YouTube video IDs were collected across 596 Bengali channels covering diverse domains. Only videos with existing Bengali caption tracks (manual or community-contributed) were retained to ensure baseline transcription quality. 2. **Audio & Transcript Extraction** — [`yt-dlp`](https://github.com/yt-dlp/yt-dlp) was used to download audio (MP3) and pull Bengali caption/subtitle tracks (`bn` language code). 3. **Speaker Diarization** — [`pyannote-audio`](https://github.com/pyannote/pyannote-audio) was applied to each audio file to segment speech into speaker turns with precise timestamps. Outputs stored as `*_output.json`. 4. **Alignment** — YouTube transcripts were aligned with pyannote speaker segments to produce SSTT-format `*_unified.json` files. A cleaned final version is stored in `diarization_transcription_final/`. #### Who are the Source Data Producers? The source audio and transcripts are derived from publicly available YouTube content created by Bengali-language content creators across Bangladesh and West Bengal. Content spans professional news channels, independent creators, audiobook narrators, and community contributors. ### Annotations #### Annotation Process Speaker diarization was performed automatically using pyannote-audio. Transcription was sourced from existing YouTube caption tracks — 86 videos have manually created captions; the remaining ones use auto-generated YouTube captions. Alignment between diarization segments and caption timestamps was performed programmatically. #### Who are the Annotators? Diarization: pyannote-audio (automated). Transcription: YouTube caption system and original content creators. Post-processing and pipeline: Team_Villagers (Sanjid Hasan, A H M Fuad, Risalat Labib, Bayazid Hasan). #### Personal and Sensitive Information The dataset contains publicly broadcast speech from YouTube. Speaker labels are anonymous (`SPEAKER_00`, etc.) and are not linked to real-world identities. No cross-video speaker identity tracking is performed. Content creators retain their original copyright; this dataset is intended for research and non-commercial use only. > If you are a content creator and wish to have your content removed from this dataset, please open an issue or contact us directly. --- ## Bias, Risks, and Limitations - **Transcript quality varies** — 86 videos have human-verified captions; 1,254 use auto-generated YouTube captions which may contain recognition errors, especially for dialectal speech and code-switching. - **Audio quality varies** — sourced from diverse YouTube content; some recordings contain background music, overlapping speakers, or artifacts. - **~194 hours pending** — approximately 321 videos are sourced and diarized but not yet fully uploaded to this repository. - **Speaker labels are local** — `SPEAKER_00`, `SPEAKER_01` etc. are per-video labels only. Cross-video speaker identity is not tracked. - **Code-switching** — some content contains Bengali-English mixing, which reflects real usage but may affect monolingual ASR models. - **Geographic bias** — the majority of content originates from Dhaka-centric media channels; rural and minority dialects may be underrepresented relative to their speaker populations. ### Recommendations Users training ASR models should consider filtering by transcript type (`manual` vs `auto`) and evaluating on a held-out human-verified subset before deployment. For dialect-robust training, stratified sampling across the category and channel distribution is recommended. --- ## Citation If you use Lipi-Ghor in your research, please cite: **BibTeX:** ```bibtex @dataset{lipighor2026, title = {Lipi-Ghor: A Large-Scale Bengali Speech Dataset with Speaker Diarization and Transcription}, author = {Hasan, Sanjid and Fuad, A. H. M. and Labib, Risalat and Hasan, Bayazid}, year = {2026}, publisher = {Hugging Face}, doi = {10.57967/hf/7877}, url = {https://huggingface.co/datasets/Sanjidh090/Lipi-Ghor-bn-882-SSTT}, note = {DL Sprint 4.0, Team Villagers} } ``` **APA:** Hasan, S., Fuad, A. H. M., Labib, R., & Hasan, B. (2025). *Lipi-Ghor: A Large-Scale Bengali Speech Dataset with Speaker Diarization and Transcription* [Dataset]. HuggingFace. https://huggingface.co/datasets/Sanjidh090/Lipi-Ghor-bn-882-SSTT Paper Citation: ```bibtex @misc{hasan2026make, title = {Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment}, author = {Hasan, Sanjid and Labib, Risalat and Fuad, A. H. M. and Hasan, Bayazid}, year = {2026}, eprint = {2602.23070}, archivePrefix = {arXiv}, primaryClass = {cs.SD}, url = {https://arxiv.org/abs/2602.23070} } ``` --- ## Glossary | Term | Definition | |------|-----------| | **SSTT** | Speaker, Speech, Transcription, Timestamp — the annotation schema used in this dataset | | **Diarization** | The process of segmenting audio by speaker ("who spoke when") | | **pyannote-audio** | State-of-the-art open-source speaker diarization library | | **yt-dlp** | Open-source tool for downloading YouTube audio and subtitles | | **bn** | ISO 639-1 language code for Bengali/Bangla | --- ## Dataset Card Authors **Team_Villagers** — DL Sprint 4.0 - Sanjid Hasan - A H M Fuad - Risalat Labib - Bayazid Hasan ## Dataset Card Contact Open an issue on the HuggingFace repository or contact via the repository discussion tab. --- ## Acknowledgements - [yt-dlp](https://github.com/yt-dlp/yt-dlp) for audio and caption extraction - [pyannote-audio](https://github.com/pyannote/pyannote-audio) for speaker diarization - All Bengali content creators whose work made this dataset possible - DL Sprint 4.0 organizers --- *লিপিঘর — বাংলা ভাষার জন্য, বাংলা ভাষার গবেষকদের জন্য।* *Lipi-Ghor — for the Bengali language, for Bengali language researchers.*

提供机构：

frovolts

搜集汇总

数据集介绍

构建方式

在低资源语言处理领域，孟加拉语自动语音识别研究长期面临数据匮乏的挑战。Lipi-Ghor数据集的构建旨在填补这一空白，其核心流程始于从596个孟加拉语YouTube频道中精心筛选出1019个具备字幕轨道的公开视频。通过yt-dlp工具提取音频流与对应的孟加拉语字幕文本，随后运用先进的pyannote-audio系统进行说话人日志分析，自动分割出不同说话人的语音片段并标注时间戳。最终，通过程序化对齐算法，将字幕文本与对应的说话人片段精确匹配，形成结构化的SSTT（说话人、语音、转写、时间戳）标注格式，从而构建出这个涵盖约882小时音频的大规模多领域语料库。

使用方法

研究者可通过Hugging Face的datasets库便捷加载此数据集，采用流式读取方式以高效处理大规模音频文件。该数据集主要应用于孟加拉语自动语音识别模型的训练与微调，例如针对Whisper、wav2vec2等模型进行适配；同时也适用于说话人日志技术的开发与评估，探究多说话人场景下的‘谁在何时说话’问题。此外，标注了说话人身份的语音片段可为语音合成系统提供音色参考，而丰富的方言数据则支持孟加拉语内部变体的识别研究。在使用时，建议用户根据字幕来源（人工标注或自动生成）对数据进行筛选，并在关键应用部署前，在人工验证的子集上进行充分评估以确保模型鲁棒性。

背景与挑战

背景概述

在低资源语言语音技术领域，孟加拉语作为全球超过2.3亿人口使用的语言，长期面临公开语音数据稀缺的困境。Lipi-Ghor数据集于2026年由Team_Villagers团队在DL Sprint 4.0竞赛中创建，旨在构建大规模、多领域、具备说话人日志和转录标注的孟加拉语语音语料库。该数据集从596个YouTube频道采集了约882小时音频，覆盖谈话节目、有声书、电影等多种体裁，并包含达卡标准语、吉大港方言等多种地域变体。其核心研究问题在于解决孟加拉语自动语音识别和说话人日志任务中高质量训练数据不足的瓶颈，通过提供精细的SSTT（说话人、语音、转录、时间戳）标注，为低资源语言语音模型训练与评估奠定了重要基础。

当前挑战

该数据集致力于应对孟加拉语自动语音识别与说话人日志研究中的双重挑战。在领域问题层面，孟加拉语语音数据长期存在规模有限、领域单一、缺乏说话人标注等问题，制约了复杂场景下语音模型的性能；而构建过程则面临诸多技术难题。数据来源依赖YouTube自动生成字幕，其转录准确率受方言、背景噪声和语码转换影响而波动，需通过算法对齐与清洗确保质量。说话人日志虽采用先进工具处理，但音频中存在的音乐干扰、说话人重叠等现象增加了分割与标注的复杂性。此外，数据集的方言覆盖虽广，仍难以完全均衡反映所有地域变体的实际分布，存在一定的地理偏差风险。

常用场景

经典使用场景

在孟加拉语语音处理领域，Lipi-Ghor数据集最经典的使用场景是训练和评估自动语音识别模型。该数据集凭借其约882小时的大规模、多领域语音语料，为研究者提供了丰富的真实对话样本，覆盖了谈话节目、有声读物、电影等多种内容类型。通过精细的说话人日志和转录对齐，该数据集能够有效支持端到端的ASR系统开发，尤其在处理孟加拉语复杂方言变体时展现出独特价值。

解决学术问题

该数据集主要解决了孟加拉语作为低资源语言在语音技术研究中的资源匮乏问题。传统孟加拉语语音数据集通常规模有限且缺乏说话人标注，而Lipi-Ghor通过大规模、多方言的语料收集，为说话人日志、方言识别和多语言基准测试等研究提供了坚实基础。其SSTT标注格式进一步推动了语音与文本对齐技术的研究，为理解孟加拉语口语的时空特征提供了新的数据支持。

实际应用

在实际应用层面，Lipi-Ghor数据集能够直接服务于孟加拉语语音技术的产品化开发。基于该数据集训练的ASR模型可应用于新闻转录、教育内容自动化字幕生成等领域。其说话人标注信息也为语音合成系统提供了丰富的声学特征参考，支持个性化语音助手的开发。此外，数据集涵盖的多种方言变体有助于构建更具包容性的语音接口，服务于孟加拉语不同地区的用户群体。

数据集最近研究