Non-STEM_TextBook_English
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/gt529sf5jy
下载链接
链接失效反馈官方服务:
资源简介:
The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset.
Dataset composition (full corpus):
Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ texbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada)
Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages)
Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC.
Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel)
Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports.
This repository includes:
A small preview subset of the Non-STEM English Textbooks data
Flat, viewer-friendly schema for inspection
Parquet files suitable for benchmarking and evaluation
Purpose of this dataset:
Dataset preview and validation
Model evaluation and experimentation
Schema and format inspection before full-scale access
warning: Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset.
For full details, Please contact [Em: vipul.mishra@infobay.ai]
创建时间:
2026-01-14



