IndicParam

Name: IndicParam
Creator: maas
Published: 2025-12-05 16:57:30
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/bharatgenai/IndicParam

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Card for IndicParam ### Dataset Summary IndicParam is a graduate-level benchmark designed to evaluate Large Language Models (LLMs) on their understanding of **low- and extremely low-resource Indic languages**. The dataset contains **13,207 multiple-choice questions (MCQs)** across **11 Indic languages**, plus a separate **Sanskrit–English code-mixed** set, all sourced from official UGC-NET language question papers and answer keys. ### Supported Tasks - **`multiple-choice-qa`**: Evaluate LLMs on graduate-level multiple-choice question answering across low-resource Indic languages. - **`language-understanding-evaluation`**: Assess language-specific competence (morphology, syntax, semantics, discourse) using explicitly labeled questions. - **`general-knowledge-evaluation`**: Measure factual and domain knowledge in literature, culture, history, and related disciplines. - **`question-type-evaluation`**: Analyze performance across MCQ formats (Normal MCQ, Assertion–Reason, List Matching, etc.). ### Languages IndicParam covers the following languages and one code-mixed variant: - **Low-resource (4)**: Nepali, Gujarati, Marathi, Odia - **Extremely low-resource (7)**: Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani - **Code-mixed**: Sanskrit–English (Sans-Eng) Scripts: - **Devanagari**: Nepali, Marathi, Maithili, Konkani, Bodo, Dogri, Rajasthani, Sanskrit - **Gujarati**: Gujarati - **Odia (Orya)**: Odia - **Ol Chiki (Olck)**: Santali All questions are presented in the **native script** of the target language (or in code-mixed form for Sans-Eng). --- ## Dataset Structure ### Data Instances Each instance is a single MCQ from a UGC-NET language paper. An example (Maithili): ```json { "unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e", "subject": "Maithili", "exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "question_number": 1, "question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -", "option_a": "बालकाण्डमें", "option_b": "अयोध्याकाण्डमे", "option_c": "सुन्दरकाण्डमे", "option_d": "उत्तरकाण्डमे", "correct_answer": "a", "question_type": "Normal MCQ" } ``` Questions span: - **Language Understanding (LU)**: linguistics and grammar (phonology, morphology, syntax, semantics, discourse). - **General Knowledge (GK)**: literature, authors, works, cultural concepts, history, and related factual content. ### Data Fields - **`unique_question_id`** *(string)*: Unique identifier for each question. - **`subject`** *(string)*: Name of the language / subject (e.g., `Nepali`, `Maithili`, `Sanskrit`). - **`exam_name`** *(string)*: Full exam name (UGC-NET session and subject). - **`paper_number`** *(string)*: Paper identifier as given by UGC-NET. - **`question_number`** *(int)*: Question index within the original paper. - **`question_text`** *(string)*: Question text in the target language (or Sanskrit–English code-mixed). - **`option_a`**, **`option_b`**, **`option_c`**, **`option_d`** *(string)*: Four answer options. - **`correct_answer`** *(string)*: Correct option label (`a`, `b`, `c`, or `d`). - **`question_type`** *(string)*: Question format, one of: - `Normal MCQ` - `Assertion and Reason` - `List Matching` - `Fill in the blanks` - `Identify incorrect statement` - `Ordering` ### Data Splits IndicParam is provided as a **single evaluation split**: | Split | Number of Questions | | ----- | ------------------- | | test | 13,207 | All rows are intended for **evaluation only** (no dedicated training/validation splits). --- ## Language Distribution The benchmark follows the distribution reported in the IndicParam paper: | Language | #Questions | Script | Code | | ------------- | ---------- | -------- | ---- | | Nepali | 1,038 | Devanagari | npi | | Marathi | 1,245 | Devanagari | mar | | Gujarati | 1,044 | Gujarati | guj | | Odia | 577 | Orya | ory | | Maithili | 1,286 | Devanagari | mai | | Konkani | 1,328 | Devanagari | gom | | Santali | 873 | Olck | sat | | Bodo | 1,313 | Devanagari | brx | | Dogri | 1,027 | Devanagari | doi | | Rajasthani | 1,190 | Devanagari | – | | Sanskrit | 1,315 | Devanagari | san | | Sans-Eng | 971 | (code-mixed) | – | | **Total** | **13,207** | | | Each language’s questions are drawn from its respective UGC-NET language papers. --- ## Dataset Creation ### Source and Collection - **Source**: Official UGC-NET language question papers and answer keys, downloaded from the UGC-NET/NTA website. - **Scope**: Multiple exam sessions and years, covering language/literature and linguistics papers for each of the 11 languages plus the Sanskrit–English code-mixed set. - **Extraction**: - Machine-readable PDFs are parsed directly. - Non-selectable PDFs are processed using OCR. - All text is normalized while preserving the original script and content. ### Annotation In addition to the raw MCQs, each question is annotated by question type (described in detail in the paper): - **Question type**: - Multiple-choice, Assertion–Reason, List Matching, Fill in the blanks, Identify incorrect statement, Ordering. These annotations support fine-grained analysis of model behavior across **knowledge vs. language ability** and **question format**. --- ## Considerations for Using the Data ### Social Impact IndicParam is designed to: - Enable rigorous evaluation of LLMs on **under-represented Indic languages** with substantial speaker populations but very limited web presence. - Encourage **culturally grounded** AI systems that perform robustly on Indic scripts and linguistic phenomena. - Highlight the performance gaps between high-resource and low-/extremely low-resource Indic languages, informing future pretraining and data collection efforts. Users should be aware that the content is drawn from **academic examinations**, and may over-represent formal, exam-style language relative to everyday usage. ### Evaluation Guidelines To align with the paper and allow consistent comparison: 1. **Task**: Treat each instance as a multiple-choice QA item with four options. 2. **Input format**: Present `question_text` plus the four options (`A–D`) to the model. 3. **Required output**: A single option label (`A`, `B`, `C`, or `D`), with no explanation. 4. **Decoding**: Use **greedy decoding / temperature = 0 / `do_sample = False`** to ensure deterministic outputs. 5. **Metric**: Compute **accuracy** based on exact match between predicted option and `correct_answer` (case-insensitive after mapping to A–D). 6. **Analysis**: - Report **overall accuracy**. - Break down results **per language**. --- ## Additional Information ### Citation Information If you use IndicParam in your research, please cite: ```bibtex } ``` For related Hindi-only evaluation and question-type taxonomy, please also see and cite [ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench). ### License IndicParam is released for **non-commercial research and evaluation**. ### Acknowledgments IndicParam was curated and annotated by the authors and native-speaker annotators as described in the paper. We acknowledge UGC-NET/NTA for making examination materials publicly accessible, and the broader Indic NLP community for foundational tools and resources.

# IndicParam 数据集卡片 ## 数据集概述 IndicParam 是一款面向研究生层级的基准测试集，旨在评估大语言模型（Large Language Model, LLM）对**低资源及极低资源印度语系语言**的理解能力。该数据集涵盖**11种印度语系语言**的**13207道多项选择题（multiple-choice questions, MCQs）**，此外还包含独立的**梵语-英语混合代码**数据集，所有数据均取自官方UGC-NET语言考试真题及参考答案。 ## 支持任务 - **`多项选择题问答`**：针对低资源印度语系语言，评估大语言模型在研究生层级的多项选择题问答能力。 - **`语言理解评估`**：通过带显式标注的题目，评估模型在特定语言上的语言能力（涵盖词法、句法、语义、语篇等维度）。 - **`通用知识评估`**：衡量模型在文学、文化、历史及相关学科中的事实性与领域知识掌握情况。 - **`题型评估`**：分析模型在不同多项选择题题型下的表现，如常规多项选择题、断言-推理题、列表匹配题等。 ## 覆盖语言 IndicParam 涵盖以下语言及1种混合代码变体： - **低资源语言（4种）**：尼泊尔语、古吉拉特语、马拉地语、奥里亚语 - **极低资源语言（7种）**：多格里语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语 - **混合代码数据集**：梵语-英语（Sans-Eng） ### 书写系统 - **天城文（Devanagari）**：尼泊尔语、马拉地语、迈蒂利语、孔卡尼语、博多语、多格里语、拉贾斯坦语、梵语 - **古吉拉特文**：古吉拉特语 - **奥里亚文（Orya）**：奥里亚语 - **奥尔奇文（Ol Chiki, Olck）**：桑塔利语所有题目均以目标语言的**原生书写系统**呈现（梵语-英语混合数据集则采用混合代码形式）。 --- ## 数据集结构 ### 数据实例每个数据实例均为一道取自UGC-NET语言考试真题的多项选择题。以下为迈蒂利语的示例： json { "unique_question_id": "782166eef1efd963b5db0e8aa42b9a6e", "subject": "Maithili", "exam_name": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Maithili Paper III hindi", "question_number": 1, "question_text": "मिथिलाभाषा रामायण' में सीताराम-विवाहक वर्णन भेल अछि -", "option_a": "बालकाण्डमें", "option_b": "अयोध्याकाण्डमे", "option_c": "सुन्दरकाण्डमे", "option_d": "उत्तरकाण्डमे", "correct_answer": "a", "question_type": "Normal MCQ" } 题目涵盖以下两类： - **语言理解（LU）**：语言学与语法知识（音系学、词法、句法、语义、语篇）。 - **通用知识（GK）**：文学、作者、作品、文化概念、历史及相关事实性内容。 ### 数据字段 - **`unique_question_id`** *(字符串类型)*：每道题的唯一标识符。 - **`subject`** *(字符串类型)*：语言/科目名称（例如`Nepali`、`Maithili`、`Sanskrit`）。 - **`exam_name`** *(字符串类型)*：完整考试名称（包含UGC-NET考试场次与科目）。 - **`paper_number`** *(字符串类型)*：UGC-NET官方给出的试卷编号。 - **`question_number`** *(整数类型)*：原试卷中的题目序号。 - **`question_text`** *(字符串类型)*：目标语言的题目文本（梵语-英语混合数据集则为混合代码形式）。 - **`option_a`**、**`option_b`**、**`option_c`**、**`option_d`** *(字符串类型)*：四个候选答案选项。 - **`correct_answer`** *(字符串类型)*：正确选项的标签（`a`、`b`、`c`或`d`）。 - **`question_type`** *(字符串类型)*：题型格式，可选值包括： - `常规多项选择题（Normal MCQ）` - `断言-推理题（Assertion and Reason）` - `列表匹配题（List Matching）` - `填空题（Fill in the blanks）` - `错误陈述识别题（Identify incorrect statement）` - `排序题（Ordering）` ### 数据划分 IndicParam 仅提供**单一评估划分**： | 划分 | 题目数量 | | ----- | ------------------- | | 测试集（test） | 13,207 | 所有数据行仅用于**评估任务**，未设置专门的训练集/验证集划分。 --- ## 语言分布该基准测试集的语言分布与IndicParam论文中公布的一致： | 语言 | 题目数量 | 书写系统 | 代码标识 | | ------------- | ---------- | -------- | ---- | | 尼泊尔语 | 1,038 | 天城文 | npi | | 马拉地语 | 1,245 | 天城文 | mar | | 古吉拉特语 | 1,044 | 古吉拉特文 | guj | | 奥里亚语 | 577 | 奥里亚文 | ory | | 迈蒂利语 | 1,286 | 天城文 | mai | | 孔卡尼语 | 1,328 | 天城文 | gom | | 桑塔利语 | 873 | 奥尔奇文（Olck） | sat | | 博多语 | 1,313 | 天城文 | brx | | 多格里语 | 1,027 | 天城文 | doi | | 拉贾斯坦语 | 1,190 | 天城文 | – | | 梵语 | 1,315 | 天城文 | san | | 梵语-英语混合 | 971 | （混合代码） | – | | **总计** | **13,207** | | | 每种语言的题目均取自对应语言的UGC-NET官方考试真题。 --- ## 数据集构建 ### 来源与采集 - **数据来源**：从UGC-NET/NTA官方网站下载的UGC-NET语言考试真题及参考答案。 - **采集范围**：涵盖多个考试场次与年份的11种语言及梵语-英语混合数据集的语言/文学与语言学类考试真题。 - **文本提取**： 1. 可直接解析的可编辑PDF将直接提取文本； 2. 不可编辑的PDF将通过光学字符识别（OCR）技术处理； 3. 所有文本均在保留原始书写系统与内容的前提下进行标准化处理。 ### 数据标注除原始多项选择题数据外，每道题目均标注了题型（详细说明见论文）：题型包括：常规多项选择题、断言-推理题、列表匹配题、填空题、错误陈述识别题、排序题。此类标注支持对模型在**知识掌握与语言能力**、**不同题型**下的表现进行细粒度分析。 --- ## 使用数据集的注意事项 ### 社会影响考量 IndicParam 旨在实现以下目标： 1. 对受众群体庞大但网络资源稀缺、代表性不足的印度语系语言，开展大语言模型的严谨评估； 2. 推动构建**基于文化适配性**的人工智能系统，使其在印度语系书写系统与语言现象上表现稳定可靠； 3. 揭示高资源与低/极低资源印度语系语言之间的模型性能差距，为后续预训练与数据采集工作提供参考。使用者需注意：本数据集内容取自**学术考试真题**，相较于日常用语，可能更偏向正式的考试风格语言。 ### 评估指南为与论文标准保持一致并确保结果可复现与可比较，请遵循以下评估指南： 1. **任务设定**：将每个数据实例视为带有四个选项的多项选择题问答任务。 2. **输入格式**：向模型输入`question_text`与四个选项（`A–D`）。 3. **输出要求**：仅输出单个选项标签（`A`、`B`、`C`或`D`），无需附加解释。 4. **解码策略**：采用**贪心解码/温度系数=0/`do_sample=False`**以确保输出结果确定一致。 5. **评估指标**：基于模型预测选项与`correct_answer`的完全匹配度计算**准确率**（将标签映射为`A–D`后不区分大小写）。 6. **结果分析**： - 报告整体准确率； - 按语言分别拆解分析结果。 --- ## 附加信息 ### 引用信息若您在研究中使用IndicParam，请引用以下文献： bibtex } 如需了解仅针对印地语的评估与题型分类体系，请参阅并引用[ParamBench](https://huggingface.co/datasets/bharatgenai/ParamBench)。 ### 使用许可 IndicParam 仅面向**非商业性研究与评估用途**发布。 ### 致谢 IndicParam 由论文作者与母语标注者按照论文所述流程进行整理与标注。我们感谢UGC-NET/NTA将考试资料公开共享，同时感谢印度自然语言处理（Indic NLP）社区提供的基础工具与资源。

提供机构：

maas

创建时间：

2025-11-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集