ParamBench

Name: ParamBench
Creator: maas
Published: 2025-12-05 16:55:24
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/bharatgenai/ParamBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for ParamBench ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact](#social-impact) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Citation Information](#citation-information) - [Contributing](#contributing) ## Dataset Description - **Homepage:** [ParamBench GitHub Repository](https://github.com/bharatgenai/ParamBench) - **Repository:** [https://github.com/bharatgenai/ParamBench](https://github.com/bharatgenai/ParamBench) - **Paper:** [ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects](https://arxiv.org/abs/2508.16185) ### Dataset Summary ParamBench is a comprehensive graduate-level benchmark designed to evaluate Large Language Models (LLMs) on their understanding of Indic subjects. The dataset contains **17,275 multiple-choice questions** in **Hindi** across **21 diverse subjects** from Indian competitive examinations. This benchmark addresses a critical gap in evaluating LLMs on culturally and linguistically diverse content, specifically focusing on India-specific knowledge domains that are underrepresented in existing benchmarks. ### Supported Tasks This dataset supports the following tasks: - `multiple-choice-qa`: The dataset can be used to evaluate language models on multiple-choice question answering in Hindi - `cultural-knowledge-evaluation`: Assessing LLM understanding of India-specific cultural and academic content - `subject-wise-evaluation`: Fine-grained analysis of model performance across 21 different subjects - `question-type-evaluation`: Detailed analysis of model performance across different question types (Normal MCQ, Assertion and Reason, Blank-filling, etc.) ### Languages The dataset is in **Hindi** (hi). ## Dataset Structure ### Data Instances An example from the dataset: ```json { "unique_question_id": "5d210d8db510451d6bf01b493a0f4430", "subject": "Anthropology", "exam_name": "Question Papers of NET Dec. 2012 Anthropology Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Anthropology Paper III hindi", "question_number": 1, "question_text": "भारतीय मध्य पाषाणकाल निम्नलिखित में से किस स्थान पर सर्वोत्तम प्रदर्शित है ?", "option_a": "गिद्दालूर", "option_b": "नेवासा", "option_c": "टेरी समूह", "option_d": "बागोर", "correct_answer": "D", "question_type": "Normal MCQ" } ``` ### Data Fields - `unique_question_id` (string): Unique identifier for each question - `subject` (string): One of 21 subject categories - `exam_name` (string): Name of the source examination - `paper_number` (string): Paper/section identifier - `question_number` (int): Question number in the original exam - `question_text` (string): The question text in Hindi - `option_a` (string): First option - `option_b` (string): Second option - `option_c` (string): Third option - `option_d` (string): Fourth option - `correct_answer` (string): Correct option (A, B, C, or D) - `question_type` (string): Type of question (Normal MCQ, Assertion and Reason, etc.) ### Data Splits The dataset contains a single `test` split with 17,275 questions. | Split | Number of Questions | |-------|-------------------| | test | 17,275 | ## Subject Distribution The 21 subjects covered in ParamBench (sorted by number of questions): | Subject | Number of Questions | Percentage | |---------|-------------------|------------| | Education | 1,199 | 6.94% | | Sociology | 1,191 | 6.89% | | Anthropology | 1,139 | 6.60% | | Psychology | 1,102 | 6.38% | | Archaeology | 1,076 | 6.23% | | History | 996 | 5.77% | | Comparative Study of Religions | 954 | 5.52% | | Law | 951 | 5.51% | | Indian Culture | 927 | 5.37% | | Economics | 919 | 5.32% | | Current Affairs | 833 | 4.82% | | Philosophy | 817 | 4.73% | | Political Science | 774 | 4.48% | | Drama and Theatre | 649 | 3.76% | | Sanskrit | 639 | 3.70% | | Karnataka Music | 617 | 3.57% | | Tribal and Regional Language | 611 | 3.54% | | Person on Instruments | 596 | 3.45% | | Defence and Strategic Studies | 521 | 3.02% | | Music | 433 | 2.51% | | Yoga | 331 | 1.92% | | **Total** | **17,275** | **100%** | ## Dataset Creation ## Considerations for Using the Data ### Social Impact This dataset aims to: - Promote development of culturally-aware AI systems - Reduce bias in LLMs towards Western-centric knowledge - Support research in multilingual and multicultural AI - Enhance LLM capabilities for Indian languages and contexts ### Evaluation Guidelines When evaluating models on ParamBench: 1. Use greedy decoding (temperature=0) for consistent results 2. Evaluate responses based on exact match with correct options (A, B, C, or D) 3. Consider subject-wise performance for detailed analysis 4. Report both overall accuracy and per-subject breakdowns ## Additional Information Key contributors include: - [Ayush Maheshwari](https://huggingface.co/acomquest) - Kaushal Sharma - [Vivek Patel](https://bento.me/vivek-patel) - Aditya Maheshwari We thank all data annotators involved in the dataset curation process. ### Citation Information If you use ParamBench in your research, please cite: ```bibtex @article{parambench2024, title={ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects}, author={[Author Names]}, journal={arXiv preprint arXiv:2508.16185}, year={2024}, url={https://arxiv.org/abs/2508.16185} } ``` ### License This dataset is released for **non-commercial research and evaluation**. ### Acknowledgments We thank all the contributors who helped create this benchmark. --- **Note**: This dataset is part of our ongoing effort to make AI systems more inclusive and culturally aware. We encourage researchers to use this benchmark to evaluate and improve their models' understanding of Indic content. ---

# ParamBench 数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务](#supported-tasks) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段说明](#data-fields) - [数据划分](#data-splits) - [数据集使用注意事项](#considerations-for-using-the-data) - [社会影响](#social-impact) - [附加信息](#additional-information) - [数据集策划方](#dataset-curators) - [引用信息](#citation-information) - [贡献指南](#contributing) ## 数据集概述 - **主页**: [ParamBench GitHub代码仓库](https://github.com/bharatgenai/ParamBench) - **代码仓库**: [https://github.com/bharatgenai/ParamBench](https://github.com/bharatgenai/ParamBench) - **关联论文**: [ParamBench：用于评估大语言模型（Large Language Model，LLM）对印度本土主题理解能力的研究生层级基准测试集](https://arxiv.org/abs/2508.16185) ### 数据集摘要 ParamBench是一款面向研究生层级的综合基准测试集，旨在评估大语言模型对印度本土主题的理解能力。该数据集包含**17275道印地语多项选择题**，涵盖来自印度竞争性考试的**21个多样化学科**。本基准测试填补了当前基准测试在评估语言模型对文化与语言多样性内容理解方面的关键空白，尤其聚焦于现有基准中占比不足的印度专属知识领域。 ### 支持任务本数据集支持以下任务： - `multiple-choice-qa`: 可用于评估语言模型在印地语环境下的多项选择题问答能力 - `cultural-knowledge-evaluation`: 评估大语言模型对印度专属文化与学术内容的理解水平 - `subject-wise-evaluation`: 针对21个不同学科开展细粒度的模型性能分析 - `question-type-evaluation`: 针对不同题型（常规多项选择题、正误推理题、填空题等）开展精细化性能分析 ### 使用语言本数据集采用印地语（hi）编写。 ## 数据集结构 ### 数据样例本数据集的一条样例如下： json { "unique_question_id": "5d210d8db510451d6bf01b493a0f4430", "subject": "Anthropology", "exam_name": "Question Papers of NET Dec. 2012 Anthropology Paper III hindi", "paper_number": "Question Papers of NET Dec. 2012 Anthropology Paper III hindi", "question_number": 1, "question_text": "भारतीय मध्य पाषाणकाल निम्नलिखित में से किस स्थान पर सर्वोत्तम प्रदर्शित है ?", "option_a": "गिद्दालूर", "option_b": "नेवासा", "option_c": "टेरी समूह", "option_d": "बागोर", "correct_answer": "D", "question_type": "Normal MCQ" } ### 数据字段说明 - `unique_question_id` (string): 每个问题的唯一标识符 - `subject` (string): 21个学科分类之一 - `exam_name` (string): 源考试的名称 - `paper_number` (string): 试卷/部分标识符 - `question_number` (int): 原考试中的题目编号 - `question_text` (string): 印地语编写的题目文本 - `option_a` (string): 第一个选项 - `option_b` (string): 第二个选项 - `option_c` (string): 第三个选项 - `option_d` (string): 第四个选项 - `correct_answer` (string): 正确选项（A、B、C或D） - `question_type` (string): 题目类型（常规多项选择题、正误推理题等） ### 数据划分本数据集仅包含一个`test`划分，共计17275道问题。 | 划分 | 问题数量 | |-------|-------------------| | test | 17275 | ## 学科分布 ParamBench涵盖的21个学科（按问题数量降序排列）： | 学科 | 问题数量 | 占比 | |---------|-------------------|------------| | 教育学 | 1,199 | 6.94% | | 社会学 | 1,191 | 6.89% | | 人类学 | 1,139 | 6.60% | | 心理学 | 1,102 | 6.38% | | 考古学 | 1,076 | 6.23% | | 历史学 | 996 | 5.77% | | 宗教学比较研究 | 954 | 5.52% | | 法学 | 951 | 5.51% | | 印度文化 | 927 | 5.37% | | 经济学 | 919 | 5.32% | | 时事政治 | 833 | 4.82% | | 哲学 | 817 | 4.73% | | 政治学 | 774 | 4.48% | | 戏剧学 | 649 | 3.76% | | 梵语 | 639 | 3.70% | | 卡纳塔克传统音乐 | 617 | 3.57% | | 部落与区域语言 | 611 | 3.54% | | 器乐表演 | 596 | 3.45% | | 国防与战略研究 | 521 | 3.02% | | 音乐学 | 433 | 2.51% | | 瑜伽 | 331 | 1.92% | | **总计** | **17275** | **100%** | ## 数据集创建 ## 数据集使用注意事项 ### 社会影响本数据集旨在达成以下目标： - 推动具备文化感知能力的人工智能系统研发 - 降低大语言模型对西方中心主义知识的认知偏见 - 支持多语言与多文化人工智能领域的研究 - 提升大语言模型在印度语言与场景下的应用能力 ### 评估指南在ParamBench上评估模型性能时，请遵循以下规范： 1. 使用贪婪解码（temperature=0）以保证实验结果的一致性 2. 基于模型输出与正确选项（A、B、C或D）的精确匹配度评估模型表现 3. 针对各学科的性能开展细分分析 4. 同时报告整体准确率与各学科的性能拆分结果 ## 附加信息 ### 数据集策划方主要贡献者包括： - [Ayush Maheshwari](https://huggingface.co/acomquest) - Kaushal Sharma - [Vivek Patel](https://bento.me/vivek-patel) - Aditya Maheshwari 我们感谢所有参与数据集策划工作的数据标注人员。 ### 引用信息如果您在研究中使用ParamBench，请引用以下文献： bibtex @article{parambench2024, title={ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects}, author={[Author Names]}, journal={arXiv preprint arXiv:2508.16185}, year={2024}, url={https://arxiv.org/abs/2508.16185} } ### 授权协议本数据集仅用于**非商业研究与评估用途**。 ### 致谢我们感谢所有参与创建本基准测试的贡献者。 --- **注**: 本数据集是我们致力于让人工智能系统更具包容性与文化感知能力的持续努力的一部分。我们鼓励研究人员使用该基准测试来评估并改进其模型对印度本土内容的理解能力。 ---

提供机构：

maas

创建时间：

2025-10-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集