IndicParam
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/bharatgenai/IndicParam
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Card for IndicParam
### Dataset Summary
IndicParam is a graduate-level benchmark designed to evaluate Large Language Models (LLMs) on their understanding of **low- and extremely low-resource Indic languages**.
The dataset contains **13,207 multiple-choice questions (MCQs)** across **11 Indic languages**, plus a separate **Sanskrit–English code-mixed** set, all sourced from official UGC-NET language question papers and answer keys.
### Supported Tasks
- **`multiple-choice-qa`**: Evaluate LLMs on graduate-level multiple-choice question answering across low-resource Indic languages.
- **`language-understanding-evaluation`**: Assess language-specific competence (morphology, syntax, semantics, discourse) using explicitly labeled questions.
- **`general-knowledge-evaluation`**: Measure factual and domain knowledge in literature, culture, history, and related disciplines.
- **`question-type-evaluation`**: Analyze performance across MCQ formats (Normal MCQ, Assertion–Reason, List Matching, etc.).
### Languages
IndicParam covers the following languages and one code-mixed variant:
- **Low-resource (4)**: Nepali, Gujarati, Marathi, Odia
- **Extremely low-resource (7)**: Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani
- **Code-mixed**: Sanskrit–English (Sans-Eng)
Scripts:
- **Devanagari**: Nepali, Marathi, Maithili, Konkani, Bodo, Dogri, Rajasthani, Sanskrit
- **Gujarati**: Gujarati
- **Odia (Orya)**: Odia
- **Ol Chiki (Olck)**: Santali
All questions are presented in the **native script** of the target language (or in code-mixed form for Sans-Eng).
---
## Dataset Structure
### Data Instances
Each instance is a single MCQ from a UGC-NET language paper. An example (Maithili):
```json
{
"unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e",
"subject": "Maithili",
"exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi",
"paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi",
"question_number": 1,
"question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -",
"option_a": "बालकाण्डमें",
"option_b": "अयोध्याकाण्डमे",
"option_c": "सुन्दरकाण्डमे",
"option_d": "उत्तरकाण्डमे",
"correct_answer": "a",
"question_type": "Normal MCQ"
}
```
Questions span:
- **Language Understanding (LU)**: linguistics and grammar (phonology, morphology, syntax, semantics, discourse).
- **General Knowledge (GK)**: literature, authors, works, cultural concepts, history, and related factual content.
### Data Fields
- **`unique_question_id`** *(string)*: Unique identifier for each question.
- **`subject`** *(string)*: Name of the language / subject (e.g., `Nepali`, `Maithili`, `Sanskrit`).
- **`exam_name`** *(string)*: Full exam name (UGC-NET session and subject).
- **`paper_number`** *(string)*: Paper identifier as given by UGC-NET.
- **`question_number`** *(int)*: Question index within the original paper.
- **`question_text`** *(string)*: Question text in the target language (or Sanskrit–English code-mixed).
- **`option_a`**, **`option_b`**, **`option_c`**, **`option_d`** *(string)*: Four answer options.
- **`correct_answer`** *(string)*: Correct option label (`a`, `b`, `c`, or `d`).
- **`question_type`** *(string)*: Question format, one of:
- `Normal MCQ`
- `Assertion and Reason`
- `List Matching`
- `Fill in the blanks`
- `Identify incorrect statement`
- `Ordering`
### Data Splits
IndicParam is provided as a **single evaluation split**:
| Split | Number of Questions |
| ----- | ------------------- |
| test | 13,207 |
All rows are intended for **evaluation only** (no dedicated training/validation splits).
---
## Language Distribution
The benchmark follows the distribution reported in the IndicParam paper:
| Language | #Questions | Script | Code |
| ------------- | ---------- | -------- | ---- |
| Nepali | 1,038 | Devanagari | npi |
| Marathi | 1,245 | Devanagari | mar |
| Gujarati | 1,044 | Gujarati | guj |
| Odia | 577 | Orya | ory |
| Maithili | 1,286 | Devanagari | mai |
| Konkani | 1,328 | Devanagari | gom |
| Santali | 873 | Olck | sat |
| Bodo | 1,313 | Devanagari | brx |
| Dogri | 1,027 | Devanagari | doi |
| Rajasthani | 1,190 | Devanagari | – |
| Sanskrit | 1,315 | Devanagari | san |
| Sans-Eng | 971 | (code-mixed) | – |
| **Total** | **13,207** | | |
Each language’s questions are drawn from its respective UGC-NET language papers.
---
## Dataset Creation
### Source and Collection
- **Source**: Official UGC-NET language question papers and answer keys, downloaded from the UGC-NET/NTA website.
- **Scope**: Multiple exam sessions and years, covering language/literature and linguistics papers for each of the 11 languages plus the Sanskrit–English code-mixed set.
- **Extraction**:
- Machine-readable PDFs are parsed directly.
- Non-selectable PDFs are processed using OCR.
- All text is normalized while preserving the original script and content.
### Annotation
In addition to the raw MCQs, each question is annotated by question type (described in detail in the paper):
- **Question type**:
- Multiple-choice, Assertion–Reason, List Matching, Fill in the blanks, Identify incorrect statement, Ordering.
These annotations support fine-grained analysis of model behavior across **knowledge vs. language ability** and **question format**.
---
## Considerations for Using the Data
### Social Impact
IndicParam is designed to:
- Enable rigorous evaluation of LLMs on **under-represented Indic languages** with substantial speaker populations but very limited web presence.
- Encourage **culturally grounded** AI systems that perform robustly on Indic scripts and linguistic phenomena.
- Highlight the performance gaps between high-resource and low-/extremely low-resource Indic languages, informing future pretraining and data collection efforts.
Users should be aware that the content is drawn from **academic examinations**, and may over-represent formal, exam-style language relative to everyday usage.
### Evaluation Guidelines
To align with the paper and allow consistent comparison:
1. **Task**: Treat each instance as a multiple-choice QA item with four options.
2. **Input format**: Present `question_text` plus the four options (`A–D`) to the model.
3. **Required output**: A single option label (`A`, `B`, `C`, or `D`), with no explanation.
4. **Decoding**: Use **greedy decoding / temperature = 0 / `do_sample = False`** to ensure deterministic outputs.
5. **Metric**: Compute **accuracy** based on exact match between predicted option and `correct_answer` (case-insensitive after mapping to A–D).
6. **Analysis**:
- Report **overall accuracy**.
- Break down results **per language**.
---
## Additional Information
### Citation Information
If you use IndicParam in your research, please cite:
```bibtex
}
```
For related Hindi-only evaluation and question-type taxonomy, please also see and cite [ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench).
### License
IndicParam is released for **non-commercial research and evaluation**.
### Acknowledgments
IndicParam was curated and annotated by the authors and native-speaker annotators as described in the paper.
We acknowledge UGC-NET/NTA for making examination materials publicly accessible, and the broader Indic NLP community for foundational tools and resources.
# IndicParam 数据集卡片
## 数据集概述
IndicParam 是一款面向研究生层级的基准测试集,旨在评估大语言模型(Large Language Model, LLM)对**低资源及极低资源印度语系语言**的理解能力。该数据集涵盖**11种印度语系语言**的**13207道多项选择题(multiple-choice questions, MCQs)**,此外还包含独立的**梵语-英语混合代码**数据集,所有数据均取自官方UGC-NET语言考试真题及参考答案。
## 支持任务
- **`多项选择题问答`**:针对低资源印度语系语言,评估大语言模型在研究生层级的多项选择题问答能力。
- **`语言理解评估`**:通过带显式标注的题目,评估模型在特定语言上的语言能力(涵盖词法、句法、语义、语篇等维度)。
- **`通用知识评估`**:衡量模型在文学、文化、历史及相关学科中的事实性与领域知识掌握情况。
- **`题型评估`**:分析模型在不同多项选择题题型下的表现,如常规多项选择题、断言-推理题、列表匹配题等。
## 覆盖语言
IndicParam 涵盖以下语言及1种混合代码变体:
- **低资源语言(4种)**:尼泊尔语、古吉拉特语、马拉地语、奥里亚语
- **极低资源语言(7种)**:多格里语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语
- **混合代码数据集**:梵语-英语(Sans-Eng)
### 书写系统
- **天城文(Devanagari)**:尼泊尔语、马拉地语、迈蒂利语、孔卡尼语、博多语、多格里语、拉贾斯坦语、梵语
- **古吉拉特文**:古吉拉特语
- **奥里亚文(Orya)**:奥里亚语
- **奥尔奇文(Ol Chiki, Olck)**:桑塔利语
所有题目均以目标语言的**原生书写系统**呈现(梵语-英语混合数据集则采用混合代码形式)。
---
## 数据集结构
### 数据实例
每个数据实例均为一道取自UGC-NET语言考试真题的多项选择题。以下为迈蒂利语的示例:
json
{
"unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e",
"subject": "Maithili",
"exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi",
"paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi",
"question_number": 1,
"question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -",
"option_a": "बालकाण्डमें",
"option_b": "अयोध्याकाण्डमे",
"option_c": "सुन्दरकाण्डमे",
"option_d": "उत्तरकाण्डमे",
"correct_answer": "a",
"question_type": "Normal MCQ"
}
题目涵盖以下两类:
- **语言理解(LU)**:语言学与语法知识(音系学、词法、句法、语义、语篇)。
- **通用知识(GK)**:文学、作者、作品、文化概念、历史及相关事实性内容。
### 数据字段
- **`unique_question_id`** *(字符串类型)*:每道题的唯一标识符。
- **`subject`** *(字符串类型)*:语言/科目名称(例如`Nepali`、`Maithili`、`Sanskrit`)。
- **`exam_name`** *(字符串类型)*:完整考试名称(包含UGC-NET考试场次与科目)。
- **`paper_number`** *(字符串类型)*:UGC-NET官方给出的试卷编号。
- **`question_number`** *(整数类型)*:原试卷中的题目序号。
- **`question_text`** *(字符串类型)*:目标语言的题目文本(梵语-英语混合数据集则为混合代码形式)。
- **`option_a`**、**`option_b`**、**`option_c`**、**`option_d`** *(字符串类型)*:四个候选答案选项。
- **`correct_answer`** *(字符串类型)*:正确选项的标签(`a`、`b`、`c`或`d`)。
- **`question_type`** *(字符串类型)*:题型格式,可选值包括:
- `常规多项选择题(Normal MCQ)`
- `断言-推理题(Assertion and Reason)`
- `列表匹配题(List Matching)`
- `填空题(Fill in the blanks)`
- `错误陈述识别题(Identify incorrect statement)`
- `排序题(Ordering)`
### 数据划分
IndicParam 仅提供**单一评估划分**:
| 划分 | 题目数量 |
| ----- | ------------------- |
| 测试集(test) | 13,207 |
所有数据行仅用于**评估任务**,未设置专门的训练集/验证集划分。
---
## 语言分布
该基准测试集的语言分布与IndicParam论文中公布的一致:
| 语言 | 题目数量 | 书写系统 | 代码标识 |
| ------------- | ---------- | -------- | ---- |
| 尼泊尔语 | 1,038 | 天城文 | npi |
| 马拉地语 | 1,245 | 天城文 | mar |
| 古吉拉特语 | 1,044 | 古吉拉特文 | guj |
| 奥里亚语 | 577 | 奥里亚文 | ory |
| 迈蒂利语 | 1,286 | 天城文 | mai |
| 孔卡尼语 | 1,328 | 天城文 | gom |
| 桑塔利语 | 873 | 奥尔奇文(Olck) | sat |
| 博多语 | 1,313 | 天城文 | brx |
| 多格里语 | 1,027 | 天城文 | doi |
| 拉贾斯坦语 | 1,190 | 天城文 | – |
| 梵语 | 1,315 | 天城文 | san |
| 梵语-英语混合 | 971 | (混合代码) | – |
| **总计** | **13,207** | | |
每种语言的题目均取自对应语言的UGC-NET官方考试真题。
---
## 数据集构建
### 来源与采集
- **数据来源**:从UGC-NET/NTA官方网站下载的UGC-NET语言考试真题及参考答案。
- **采集范围**:涵盖多个考试场次与年份的11种语言及梵语-英语混合数据集的语言/文学与语言学类考试真题。
- **文本提取**:
1. 可直接解析的可编辑PDF将直接提取文本;
2. 不可编辑的PDF将通过光学字符识别(OCR)技术处理;
3. 所有文本均在保留原始书写系统与内容的前提下进行标准化处理。
### 数据标注
除原始多项选择题数据外,每道题目均标注了题型(详细说明见论文):
题型包括:常规多项选择题、断言-推理题、列表匹配题、填空题、错误陈述识别题、排序题。
此类标注支持对模型在**知识掌握与语言能力**、**不同题型**下的表现进行细粒度分析。
---
## 使用数据集的注意事项
### 社会影响考量
IndicParam 旨在实现以下目标:
1. 对受众群体庞大但网络资源稀缺、代表性不足的印度语系语言,开展大语言模型的严谨评估;
2. 推动构建**基于文化适配性**的人工智能系统,使其在印度语系书写系统与语言现象上表现稳定可靠;
3. 揭示高资源与低/极低资源印度语系语言之间的模型性能差距,为后续预训练与数据采集工作提供参考。
使用者需注意:本数据集内容取自**学术考试真题**,相较于日常用语,可能更偏向正式的考试风格语言。
### 评估指南
为与论文标准保持一致并确保结果可复现与可比较,请遵循以下评估指南:
1. **任务设定**:将每个数据实例视为带有四个选项的多项选择题问答任务。
2. **输入格式**:向模型输入`question_text`与四个选项(`A–D`)。
3. **输出要求**:仅输出单个选项标签(`A`、`B`、`C`或`D`),无需附加解释。
4. **解码策略**:采用**贪心解码/温度系数=0/`do_sample=False`**以确保输出结果确定一致。
5. **评估指标**:基于模型预测选项与`correct_answer`的完全匹配度计算**准确率**(将标签映射为`A–D`后不区分大小写)。
6. **结果分析**:
- 报告整体准确率;
- 按语言分别拆解分析结果。
---
## 附加信息
### 引用信息
若您在研究中使用IndicParam,请引用以下文献:
bibtex
}
如需了解仅针对印地语的评估与题型分类体系,请参阅并引用[ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench)。
### 使用许可
IndicParam 仅面向**非商业性研究与评估用途**发布。
### 致谢
IndicParam 由论文作者与母语标注者按照论文所述流程进行整理与标注。
我们感谢UGC-NET/NTA将考试资料公开共享,同时感谢印度自然语言处理(Indic NLP)社区提供的基础工具与资源。
提供机构:
maas
创建时间:
2025-11-27



