NCTB-PrimaryText: A Curriculum-Aligned Textbook Chunk Dataset for Bangla and English Primary Education.
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/w3hj4m45d9
下载链接
链接失效反馈官方服务:
资源简介:
NCTB-PrimaryText is a curriculum-aligned textbook chunk dataset for Bangla and English primary education, prepared from officially published NCTB primary textbook PDFs and released in machine-readable JSONL format. The data were generated through an end-to-end digitization pipeline that renders textbook pages to high-resolution images, applies bilingual OCR, performs language-aware cleaning to remove scanning artifacts while preserving educational structure, and segments the text into short, pedagogically coherent chunks suitable for retrieval and tutoring. Each JSON record includes grade and subject metadata, chapter number and title, a deterministic chunk identifier in the form `---`, and the cleaned chunk text. The dataset is intended for curriculum-grounded NLP research and educational applications, including retrieval-augmented generation, tutoring and question answering, multilingual retrieval, and benchmarking in low-resource settings; limitations include unavoidable OCR noise and comparatively higher extraction difficulty for mathematics content due to symbols and layout-heavy expressions.
创建时间:
2026-02-19



