LingoIITGN/Gurukul
收藏Hugging Face2026-03-19 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/LingoIITGN/Gurukul
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nd-4.0
language:
- en
size_categories:
- 10K<n<100K
---
# Gurukul
**Gurukul** is an educational question-answering dataset aligned with the **Indian school curriculum**, building on the original Gurukul series. It contains high-quality QA pairs derived from Class-level textbooks (primarily English prose and related subjects), designed to support reading comprehension, vocabulary building, inference, and curriculum-based language understanding in educational AI applications.
## Overview
Gurukul provides structured question-answer pairs extracted from Indian educational NCERT textbooks, with rich contextual passages. It targets school-level content (mainly secondary education) to enable training and evaluation of models for:
- Educational question answering
- Curriculum-aligned reading comprehension
- Vocabulary, idiom, antonym/synonym, and inference tasks
- Development of AI tutors / assistants for Indian students
Key features:
- Aligned with Indian education system (NCERT-style content)
- Focus on English language learning in school context
- High-quality, human-curated or refined examples
## Languages
- **English** (primary language of questions, answers, and contexts)
### Covered Subjects and Classes
Gurukul draws from NCERT-aligned textbooks, supporting multiple core subjects across secondary and higher secondary levels:
| Subject | Classes Covered | Focus Areas / Example Topics |
|--------------|--------------------------|-----------------------------------------------------------|
| **English** | Class 9 – 12 | Prose, poetry, comprehension, vocabulary, grammar, literature (e.g., biographies, stories, idioms) |
| **Mathematics** | Class 9 – 12 | Algebra, geometry, trigonometry, calculus basics, number systems, statistics, coordinate geometry |
| **Science** | Class 9 – 12 | Physics (motion, force, electricity), Chemistry (atoms, reactions, acids/bases), Biology (life processes, heredity, ecology) |
- Questions are curriculum-aligned, often chapter-specific.
## Supported Tasks
- **Question Answering** (abstractive / extractive from given context)
- **Reading Comprehension**
- **Vocabulary & Language Understanding** (definitions, antonyms, idioms)
- **Educational NLP** (school-level Question and explanation generation)
## Dataset Structure
- **Size**: ~10K–20K examples
- **Core Columns**:
| Column | Type | Description |
|-----------|--------|-----------------------------------------------------------------------------|
| `question`| string | The comprehension or knowledge question |
| `answer` | string | Reference answer (detailed or concise) |
| `context` | string | Relevant textbook passage or expanded explanation |
| `chapter` | string | Chapter identifier (e.g., prose chapter codes) |
| `class` | string | School level (e.g., Class 9, Class 10) |
| `subject` | string | Subject area (primarily English; possibly others in extensions) |
### Dataset Description
- **Curated by:** [Lingo Research Group at IIT Gandhinagar](https://lingo.iitgn.ac.in/)
- **Licensed by:** cc-by-4.0
## Contact US ✉️
[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
Mail at: [lingo@iitgn.ac.in](lingo@iitgn.ac.in)
---
许可证:CC BY-ND 4.0
语言:
- 英语
样本量范围:
- 10000 < 样本量 < 100000
---
# Gurukul
**Gurukul** 是一款适配印度学校课程体系的教育问答数据集,基于初代Gurukul系列打造。数据集包含源自各年级教材(主要为英语文本及相关学科)的高质量问答对,旨在支撑教育人工智能应用中的阅读理解、词汇积累、推理以及适配课程的语言理解任务。
## 数据集概览
Gurukul 提供从印度教育**印度国家教育研究与培训理事会(National Council of Educational Research and Training,NCERT)**教材中提取的结构化问答对,配有丰富的上下文段落。其面向中学阶段内容,可用于训练与评估模型完成以下任务:
- 教育问答任务
- 适配课程的阅读理解任务
- 词汇、习语、反/同义词及推理任务
- 面向印度学生的AI导师与助手开发
### 核心特性
- 适配印度教育体系(NCERT风格内容)
- 聚焦学校场景下的英语语言学习
- 高质量、经人工整理或优化的样本
## 语言说明
- **英语**(问答内容及上下文的主要语言)
## 覆盖学科与学段
Gurukul 取材于适配NCERT标准的教材,覆盖中等教育及高等中等教育阶段的多门核心学科:
| 学科分类 | 覆盖学段 | 重点领域/示例主题 |
|--------------|--------------------------|-----------------------------------------------------------|
| **英语** | 9至12年级 | 散文、诗歌、阅读理解、词汇、语法、文学作品(如传记、故事、习语) |
| **数学** | 9至12年级 | 代数、几何、三角函数、微积分基础、数制、统计学、解析几何 |
| **科学** | 9至12年级 | 物理(运动、力、电学)、化学(原子、化学反应、酸碱)、生物(生命过程、遗传、生态学) |
- 所有问题均适配课程要求,通常按章节划分。
## 支持任务类型
- **问答任务**(基于给定上下文的抽取式/生成式问答)
- **阅读理解任务**
- **词汇与语言理解任务**(词义、反义词、习语)
- **教育自然语言处理任务**(中学级问答及解析生成)
## 数据集结构
- **样本规模**:约10000–20000条样本
- **核心字段**:
| 字段名 | 数据类型 | 说明 |
|-----------|--------|-----------------------------------------------------------------------------|
| `question`| 字符串 | 阅读理解或知识性问题 |
| `answer` | 字符串 | 参考答案(详细或简洁版) |
| `context` | 字符串 | 相关教材段落或扩展解释 |
| `chapter` | 字符串 | 章节标识(如散文章节编码) |
| `class` | 字符串 | 学校学段(如9年级、10年级) |
| `subject` | 字符串 | 学科领域(主要为英语;扩展版本可能包含其他学科) |
### 数据集详情
- **整理方**:[印度理工学院甘地纳格尔分校Lingo研究团队](https://lingo.iitgn.ac.in/)
- **授权协议**:CC BY 4.0
## 联系方式 ✉️
[印度理工学院甘地纳格尔分校Lingo研究团队](https://labs.iitgn.ac.in/lingo/) </br>
邮件联系:[lingo@iitgn.ac.in](lingo@iitgn.ac.in)
提供机构:
LingoIITGN



