Open-Style/Open-LLM-Benchmark
收藏Open-LLM-Benchmark 数据集概述
数据集描述
Open-LLM-Leaderboard 数据集用于跟踪各种大型语言模型(LLMs)在开放式问题上的表现,以反映其真实能力。该数据集包括预生成的模型答案和使用基于LLM的评估器进行的评估。
数据集结构
数据集包含以下类型的文件:
- 模型响应文件:包含问题、标准答案、模型生成的答案、评估结果等信息。
- 问题文件:包含问题、答案选项、过滤信息等。
示例
模型响应文件示例
json { "question": "What is the main function of photosynthetic cells within a plant?", "gold_answer": "to convert energy from sunlight into food energy", "os_answer": "The main function of photosynthetic cells ...", "os_eval": "Correct", "mcq_answer": "C", "mcq_eval": true, "dataset": "ARC" }
问题文件示例
json { "question": "An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?", "answerKey": "C", "options": [ { "label": "A", "text": "Planetary density will decrease." }, { "label": "B", "text": "Planetary years will become longer." }, { "label": "C", "text": "Planetary days will become shorter." }, { "label": "D", "text": "Planetary gravity will become stronger." } ], "first_filter": "YES", "passage": "-", "second_filter": 10, "dataset": "ARC" }
数据集创建
数据来源
数据集包含来自多个数据集的问题,包括 MMLU、ARC、WinoGrande、PIQA、CommonsenseQA、Race、MedMCQA 和 OpenbookQA,这些数据集适合用于开放式回答。
数据收集与处理
数据收集过程涉及从上述数据集中编译问题,并使用各种LLMs生成答案。
引用
@article{myrzakhan2024openllm, title={Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena}, author={Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen}, journal={arXiv preprint }, year={2024}, }



