five

IndicParam

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/bharatgenai/IndicParam
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Card for IndicParam ### Dataset Summary IndicParam is a graduate-level benchmark designed to evaluate Large Language Models (LLMs) on their understanding of **low- and extremely low-resource Indic languages**. The dataset contains **13,207 multiple-choice questions (MCQs)** across **11 Indic languages**, plus a separate **Sanskrit–English code-mixed** set, all sourced from official UGC-NET language question papers and answer keys. ### Supported Tasks - **`multiple-choice-qa`**: Evaluate LLMs on graduate-level multiple-choice question answering across low-resource Indic languages. - **`language-understanding-evaluation`**: Assess language-specific competence (morphology, syntax, semantics, discourse) using explicitly labeled questions. - **`general-knowledge-evaluation`**: Measure factual and domain knowledge in literature, culture, history, and related disciplines. - **`question-type-evaluation`**: Analyze performance across MCQ formats (Normal MCQ, Assertion–Reason, List Matching, etc.). ### Languages IndicParam covers the following languages and one code-mixed variant: - **Low-resource (4)**: Nepali, Gujarati, Marathi, Odia - **Extremely low-resource (7)**: Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani - **Code-mixed**: Sanskrit–English (Sans-Eng) Scripts: - **Devanagari**: Nepali, Marathi, Maithili, Konkani, Bodo, Dogri, Rajasthani, Sanskrit - **Gujarati**: Gujarati - **Odia (Orya)**: Odia - **Ol Chiki (Olck)**: Santali All questions are presented in the **native script** of the target language (or in code-mixed form for Sans-Eng). --- ## Dataset Structure ### Data Instances Each instance is a single MCQ from a UGC-NET language paper. An example (Maithili): ```json { "unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e", "subject": "Maithili", "exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "question_number": 1, "question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -", "option_a": "बालकाण्डमें", "option_b": "अयोध्याकाण्डमे", "option_c": "सुन्दरकाण्डमे", "option_d": "उत्तरकाण्डमे", "correct_answer": "a", "question_type": "Normal MCQ" } ``` Questions span: - **Language Understanding (LU)**: linguistics and grammar (phonology, morphology, syntax, semantics, discourse). - **General Knowledge (GK)**: literature, authors, works, cultural concepts, history, and related factual content. ### Data Fields - **`unique_question_id`** *(string)*: Unique identifier for each question. - **`subject`** *(string)*: Name of the language / subject (e.g., `Nepali`, `Maithili`, `Sanskrit`). - **`exam_name`** *(string)*: Full exam name (UGC-NET session and subject). - **`paper_number`** *(string)*: Paper identifier as given by UGC-NET. - **`question_number`** *(int)*: Question index within the original paper. - **`question_text`** *(string)*: Question text in the target language (or Sanskrit–English code-mixed). - **`option_a`**, **`option_b`**, **`option_c`**, **`option_d`** *(string)*: Four answer options. - **`correct_answer`** *(string)*: Correct option label (`a`, `b`, `c`, or `d`). - **`question_type`** *(string)*: Question format, one of: - `Normal MCQ` - `Assertion and Reason` - `List Matching` - `Fill in the blanks` - `Identify incorrect statement` - `Ordering` ### Data Splits IndicParam is provided as a **single evaluation split**: | Split | Number of Questions | | ----- | ------------------- | | test | 13,207 | All rows are intended for **evaluation only** (no dedicated training/validation splits). --- ## Language Distribution The benchmark follows the distribution reported in the IndicParam paper: | Language | #Questions | Script | Code | | ------------- | ---------- | -------- | ---- | | Nepali | 1,038 | Devanagari | npi | | Marathi | 1,245 | Devanagari | mar | | Gujarati | 1,044 | Gujarati | guj | | Odia | 577 | Orya | ory | | Maithili | 1,286 | Devanagari | mai | | Konkani | 1,328 | Devanagari | gom | | Santali | 873 | Olck | sat | | Bodo | 1,313 | Devanagari | brx | | Dogri | 1,027 | Devanagari | doi | | Rajasthani | 1,190 | Devanagari | – | | Sanskrit | 1,315 | Devanagari | san | | Sans-Eng | 971 | (code-mixed) | – | | **Total** | **13,207** | | | Each language’s questions are drawn from its respective UGC-NET language papers. --- ## Dataset Creation ### Source and Collection - **Source**: Official UGC-NET language question papers and answer keys, downloaded from the UGC-NET/NTA website. - **Scope**: Multiple exam sessions and years, covering language/literature and linguistics papers for each of the 11 languages plus the Sanskrit–English code-mixed set. - **Extraction**: - Machine-readable PDFs are parsed directly. - Non-selectable PDFs are processed using OCR. - All text is normalized while preserving the original script and content. ### Annotation In addition to the raw MCQs, each question is annotated by question type (described in detail in the paper): - **Question type**: - Multiple-choice, Assertion–Reason, List Matching, Fill in the blanks, Identify incorrect statement, Ordering. These annotations support fine-grained analysis of model behavior across **knowledge vs. language ability** and **question format**. --- ## Considerations for Using the Data ### Social Impact IndicParam is designed to: - Enable rigorous evaluation of LLMs on **under-represented Indic languages** with substantial speaker populations but very limited web presence. - Encourage **culturally grounded** AI systems that perform robustly on Indic scripts and linguistic phenomena. - Highlight the performance gaps between high-resource and low-/extremely low-resource Indic languages, informing future pretraining and data collection efforts. Users should be aware that the content is drawn from **academic examinations**, and may over-represent formal, exam-style language relative to everyday usage. ### Evaluation Guidelines To align with the paper and allow consistent comparison: 1. **Task**: Treat each instance as a multiple-choice QA item with four options. 2. **Input format**: Present `question_text` plus the four options (`A–D`) to the model. 3. **Required output**: A single option label (`A`, `B`, `C`, or `D`), with no explanation. 4. **Decoding**: Use **greedy decoding / temperature = 0 / `do_sample = False`** to ensure deterministic outputs. 5. **Metric**: Compute **accuracy** based on exact match between predicted option and `correct_answer` (case-insensitive after mapping to A–D). 6. **Analysis**: - Report **overall accuracy**. - Break down results **per language**. --- ## Additional Information ### Citation Information If you use IndicParam in your research, please cite: ```bibtex } ``` For related Hindi-only evaluation and question-type taxonomy, please also see and cite [ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench). ### License IndicParam is released for **non-commercial research and evaluation**. ### Acknowledgments IndicParam was curated and annotated by the authors and native-speaker annotators as described in the paper. We acknowledge UGC-NET/NTA for making examination materials publicly accessible, and the broader Indic NLP community for foundational tools and resources.

# IndicParam 数据集卡片 ## 数据集概述 IndicParam 是一款面向研究生层级的基准测试集,旨在评估大语言模型(Large Language Model, LLM)对**低资源及极低资源印度语系语言**的理解能力。该数据集涵盖**11种印度语系语言**的**13207道多项选择题(multiple-choice questions, MCQs)**,此外还包含独立的**梵语-英语混合代码**数据集,所有数据均取自官方UGC-NET语言考试真题及参考答案。 ## 支持任务 - **`多项选择题问答`**:针对低资源印度语系语言,评估大语言模型在研究生层级的多项选择题问答能力。 - **`语言理解评估`**:通过带显式标注的题目,评估模型在特定语言上的语言能力(涵盖词法、句法、语义、语篇等维度)。 - **`通用知识评估`**:衡量模型在文学、文化、历史及相关学科中的事实性与领域知识掌握情况。 - **`题型评估`**:分析模型在不同多项选择题题型下的表现,如常规多项选择题、断言-推理题、列表匹配题等。 ## 覆盖语言 IndicParam 涵盖以下语言及1种混合代码变体: - **低资源语言(4种)**:尼泊尔语、古吉拉特语、马拉地语、奥里亚语 - **极低资源语言(7种)**:多格里语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语 - **混合代码数据集**:梵语-英语(Sans-Eng) ### 书写系统 - **天城文(Devanagari)**:尼泊尔语、马拉地语、迈蒂利语、孔卡尼语、博多语、多格里语、拉贾斯坦语、梵语 - **古吉拉特文**:古吉拉特语 - **奥里亚文(Orya)**:奥里亚语 - **奥尔奇文(Ol Chiki, Olck)**:桑塔利语 所有题目均以目标语言的**原生书写系统**呈现(梵语-英语混合数据集则采用混合代码形式)。 --- ## 数据集结构 ### 数据实例 每个数据实例均为一道取自UGC-NET语言考试真题的多项选择题。以下为迈蒂利语的示例: json { "unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e", "subject": "Maithili", "exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "question_number": 1, "question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -", "option_a": "बालकाण्डमें", "option_b": "अयोध्याकाण्डमे", "option_c": "सुन्दरकाण्डमे", "option_d": "उत्तरकाण्डमे", "correct_answer": "a", "question_type": "Normal MCQ" } 题目涵盖以下两类: - **语言理解(LU)**:语言学与语法知识(音系学、词法、句法、语义、语篇)。 - **通用知识(GK)**:文学、作者、作品、文化概念、历史及相关事实性内容。 ### 数据字段 - **`unique_question_id`** *(字符串类型)*:每道题的唯一标识符。 - **`subject`** *(字符串类型)*:语言/科目名称(例如`Nepali`、`Maithili`、`Sanskrit`)。 - **`exam_name`** *(字符串类型)*:完整考试名称(包含UGC-NET考试场次与科目)。 - **`paper_number`** *(字符串类型)*:UGC-NET官方给出的试卷编号。 - **`question_number`** *(整数类型)*:原试卷中的题目序号。 - **`question_text`** *(字符串类型)*:目标语言的题目文本(梵语-英语混合数据集则为混合代码形式)。 - **`option_a`**、**`option_b`**、**`option_c`**、**`option_d`** *(字符串类型)*:四个候选答案选项。 - **`correct_answer`** *(字符串类型)*:正确选项的标签(`a`、`b`、`c`或`d`)。 - **`question_type`** *(字符串类型)*:题型格式,可选值包括: - `常规多项选择题(Normal MCQ)` - `断言-推理题(Assertion and Reason)` - `列表匹配题(List Matching)` - `填空题(Fill in the blanks)` - `错误陈述识别题(Identify incorrect statement)` - `排序题(Ordering)` ### 数据划分 IndicParam 仅提供**单一评估划分**: | 划分 | 题目数量 | | ----- | ------------------- | | 测试集(test) | 13,207 | 所有数据行仅用于**评估任务**,未设置专门的训练集/验证集划分。 --- ## 语言分布 该基准测试集的语言分布与IndicParam论文中公布的一致: | 语言 | 题目数量 | 书写系统 | 代码标识 | | ------------- | ---------- | -------- | ---- | | 尼泊尔语 | 1,038 | 天城文 | npi | | 马拉地语 | 1,245 | 天城文 | mar | | 古吉拉特语 | 1,044 | 古吉拉特文 | guj | | 奥里亚语 | 577 | 奥里亚文 | ory | | 迈蒂利语 | 1,286 | 天城文 | mai | | 孔卡尼语 | 1,328 | 天城文 | gom | | 桑塔利语 | 873 | 奥尔奇文(Olck) | sat | | 博多语 | 1,313 | 天城文 | brx | | 多格里语 | 1,027 | 天城文 | doi | | 拉贾斯坦语 | 1,190 | 天城文 | – | | 梵语 | 1,315 | 天城文 | san | | 梵语-英语混合 | 971 | (混合代码) | – | | **总计** | **13,207** | | | 每种语言的题目均取自对应语言的UGC-NET官方考试真题。 --- ## 数据集构建 ### 来源与采集 - **数据来源**:从UGC-NET/NTA官方网站下载的UGC-NET语言考试真题及参考答案。 - **采集范围**:涵盖多个考试场次与年份的11种语言及梵语-英语混合数据集的语言/文学与语言学类考试真题。 - **文本提取**: 1. 可直接解析的可编辑PDF将直接提取文本; 2. 不可编辑的PDF将通过光学字符识别(OCR)技术处理; 3. 所有文本均在保留原始书写系统与内容的前提下进行标准化处理。 ### 数据标注 除原始多项选择题数据外,每道题目均标注了题型(详细说明见论文): 题型包括:常规多项选择题、断言-推理题、列表匹配题、填空题、错误陈述识别题、排序题。 此类标注支持对模型在**知识掌握与语言能力**、**不同题型**下的表现进行细粒度分析。 --- ## 使用数据集的注意事项 ### 社会影响考量 IndicParam 旨在实现以下目标: 1. 对受众群体庞大但网络资源稀缺、代表性不足的印度语系语言,开展大语言模型的严谨评估; 2. 推动构建**基于文化适配性**的人工智能系统,使其在印度语系书写系统与语言现象上表现稳定可靠; 3. 揭示高资源与低/极低资源印度语系语言之间的模型性能差距,为后续预训练与数据采集工作提供参考。 使用者需注意:本数据集内容取自**学术考试真题**,相较于日常用语,可能更偏向正式的考试风格语言。 ### 评估指南 为与论文标准保持一致并确保结果可复现与可比较,请遵循以下评估指南: 1. **任务设定**:将每个数据实例视为带有四个选项的多项选择题问答任务。 2. **输入格式**:向模型输入`question_text`与四个选项(`A–D`)。 3. **输出要求**:仅输出单个选项标签(`A`、`B`、`C`或`D`),无需附加解释。 4. **解码策略**:采用**贪心解码/温度系数=0/`do_sample=False`**以确保输出结果确定一致。 5. **评估指标**:基于模型预测选项与`correct_answer`的完全匹配度计算**准确率**(将标签映射为`A–D`后不区分大小写)。 6. **结果分析**: - 报告整体准确率; - 按语言分别拆解分析结果。 --- ## 附加信息 ### 引用信息 若您在研究中使用IndicParam,请引用以下文献: bibtex } 如需了解仅针对印地语的评估与题型分类体系,请参阅并引用[ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench)。 ### 使用许可 IndicParam 仅面向**非商业性研究与评估用途**发布。 ### 致谢 IndicParam 由论文作者与母语标注者按照论文所述流程进行整理与标注。 我们感谢UGC-NET/NTA将考试资料公开共享,同时感谢印度自然语言处理(Indic NLP)社区提供的基础工具与资源。
提供机构:
maas
创建时间:
2025-11-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作