five

NCTB-PrimaryText: A Curriculum-Aligned Textbook Chunk Dataset for Bangla and English Primary Education.

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/w3hj4m45d9
下载链接
链接失效反馈
官方服务:
资源简介:
NCTB-PrimaryText is a curriculum-aligned textbook chunk dataset for Bangla and English primary education, prepared from officially published NCTB primary textbook PDFs and released in machine-readable JSONL format. The data were generated through an end-to-end digitization pipeline that renders textbook pages to high-resolution images, applies bilingual OCR, performs language-aware cleaning to remove scanning artifacts while preserving educational structure, and segments the text into short, pedagogically coherent chunks suitable for retrieval and tutoring. Each JSON record includes grade and subject metadata, chapter number and title, a deterministic chunk identifier in the form `---`, and the cleaned chunk text. The dataset is intended for curriculum-grounded NLP research and educational applications, including retrieval-augmented generation, tutoring and question answering, multilingual retrieval, and benchmarking in low-resource settings; limitations include unavoidable OCR noise and comparatively higher extraction difficulty for mathematics content due to symbols and layout-heavy expressions.
创建时间:
2026-02-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作