Non-STEM_TextBook_Arabic

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/7s74hc57d2

下载链接

链接失效反馈

官方服务：

资源简介：

The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset. Dataset composition (full corpus): Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ texbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada) Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages) Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC. Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel) Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports. This repository includes: A small preview subset of the Non-STEM Arabic TextBook data Flat, viewer-friendly schema for inspection Parquet files suitable for benchmarking and evaluation Purpose of this dataset: Dataset preview and validation Model evaluation and experimentation Schema and format inspection before full-scale access warning: Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset. For full details, Please contact [Em: vipul.mishra@infobay.ai]

本完整语料库横跨多个STEM（科学、技术、工程、数学）与非STEM学科，专为大语言模型（LLM）的训练、评估与指令微调（SFT：监督微调，RLHF：基于人类反馈的强化学习）构建。本样本仅展示该大型数据集的结构与质量水准。完整语料库构成如下： 1. 文本语料：涵盖7种语言（英语、印地语、阿拉伯语、巴哈萨语（Bahasa）、泰米尔语、泰卢固语、卡纳达语）的22000余册教材中，精选的16亿+词STEM与非STEM教育内容 2. 问答对：涵盖英语、阿拉伯语、印地语及印度语族语言的650万+条高质量STEM与非STEM领域问答对 3. 视频数据：包含10万+小时的STEM教学视频，以及3万+小时的用户生成内容（UGC） 4. 音频数据：包含82.1万+小时的播客与双通道（Dual Channel）呼叫中心录音数据 5. 医疗数据集：包含3000万+份文件，涵盖临床与诊断数据，如CT扫描、磁共振成像（MRI）、X光检查、病理数据、电子健康记录（EHRs）、超声检查报告及超声心动图报告本数据集仓库包含以下内容： - 非STEM领域阿拉伯语教材数据的小型预览子集 - 便于查看与校验的扁平化数据架构 - 适用于基准测试与评估的Parquet格式文件本数据集的用途如下： - 数据集预览与验证 - 模型评估与实验 - 全量数据集获取前的数据架构与格式校验注意：本仓库仅包含样本数据。完整数据集的获取需通过合规授权或合作协议另行申请。请注意，本仓库并非完整数据集。如需获取完整详情，请联系：vipul.mishra@infobay.ai

创建时间：

2026-01-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集