simpleqa-verified

Name: simpleqa-verified
Creator: maas
Published: 2026-01-06 16:46:51
License: 暂无描述

魔搭社区2026-01-06 更新2025-10-04 收录

下载链接：

https://modelscope.cn/datasets/google/simpleqa-verified

下载链接

链接失效反馈

官方服务：

资源简介：

# SimpleQA Verified #### A 1,000-prompt factuality benchmark from Google DeepMind and Google Research, designed to reliably evaluate LLM parametric knowledge. ▶ [SimpleQA Verified Leaderboard on Kaggle](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified)\ ▶ [Technical Report](https://arxiv.org/abs/2509.07968)\ ▶ [Evaluation Starter Code](https://www.kaggle.com/code/nanliao7/simpleqa-verified-benchmark-starter-code) ## Benchmark SimpleQA Verified is a 1,000-prompt benchmark for reliably evaluating Large Language Models (LLMs) on short-form factuality and parametric knowledge. The authors from Google DeepMind and Google Research build on [SimpleQA](https://openai.com/index/introducing-simpleqa/), originally designed by [Wei et al. (2024)](https://arxiv.org/abs/2411.04368) at OpenAI, and address limitations including noisy and incorrect labels, topical biases, and question redundancy. Similar to SimpleQA, model responses are graded with a GPT-4.1 version. The autorater prompt has been modified with a focus on forcing direct answers, preventing guessing in long responses, and improving the grading of numeric answer types. SimpleQA Verified was created to provide the research community with a more precise instrument to track genuine progress in factuality, discourage overfitting to benchmark artifacts, and ultimately foster the development of more trustworthy AI systems. ## Dataset Description This dataset is a collection 1,000 examples crafted by humans for evaluating short-format parametric factuality in LLMs. Each example is composed of: * An index (`original_index`) indicating which questions in the original [SimpleQA](https://openai.com/index/introducing-simpleqa/) benchmark the example corresponds to * A problem (`problem`) which is the prompt testing parametric knowledge, e.g. "*To whom did Mehbooba Mufti Sayed contest the 2019 Lok Sabha elections and lose?*" * A gold answer (`answer`) which is used in conjunction with the evaluation prompt to judge the correctness of an LLM's response * A topic (`topic`) and answer type (`answer_type`) classification – from the original [SimpleQA](https://openai.com/index/introducing-simpleqa/) paper, and re-classified where appropriate * Two additional metadata fields `multi_step` and `requires_reasoning` indicating whether the question requires information from multiple sources and whether it requires more complex reasoning * Golden URLs (`urls`) which are a list of at least two URLs supporting the gold answer (`answer`), collected from SimpleQA human raters and adjusted by the authors of SimpleQA Verified See the [Technical Report](https://arxiv.org/abs/2509.07968) for methodology details. ## Limitations SimpleQA Verified is meant to be be used without any tools (i.e. search or retrieval tools). With tools, the benchmark is trivial to solve which defeats its purpose. Questions, comments, or issues? Share your thoughts with us in the [discussion forum](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified/discussion?sort=hotness). ## Evaluation Prompt The evaluation prompt employed by SimpleQA Verified using GPT-4.1 as an autorater mode can be found in the [starter notebook](https://www.kaggle.com/code/nanliao7/simpleqa-verified-benchmark-starter-code) on Kaggle. ## Citation If you use this dataset in your research, please cite our technical report: ``` @misc{haas2025simpleqaverifiedreliablefactuality, title={SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge}, author={Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das}, year={2025}, eprint={2509.07968}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.07968}, } ```

# SimpleQA Verified #### 由Google DeepMind与Google Research打造的1000条提示词事实性基准测试集，旨在可靠评估大语言模型（Large Language Model）的参数化知识。 ▶ [Kaggle平台SimpleQA Verified排行榜](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified) ▶ [技术报告](https://arxiv.org/abs/2509.07968) ▶ [评估入门代码](https://www.kaggle.com/code/nanliao7/simpleqa-verified-benchmark-starter-code) ## 基准测试集说明 SimpleQA Verified是一款包含1000条提示词的基准测试集，用于可靠评估大语言模型（LLM）的短格式事实性表现与参数化知识。谷歌DeepMind与谷歌研究团队的研究者基于OpenAI团队Wei等人（2024年）最初设计的[SimpleQA](https://openai.com/index/introducing-simpleqa/)基准，修复了其存在的标签噪声与错误、主题偏见以及问题冗余等局限。与SimpleQA一致，模型生成的回复将通过GPT-4.1版本进行评分。自动评分器的提示词已进行优化，重点要求直接作答、避免在长回复中随意猜测，并提升对数值型答案的评分准确性。开发SimpleQA Verified的初衷是为研究社区提供一套更为精准的评估工具，用以追踪事实性表现的真实进展，避免模型过拟合基准测试的固有偏差，并最终推动更可信的人工智能系统的研发。 ## 数据集详情本数据集包含1000条人工构建的示例，用于评估大语言模型（LLM）的短格式参数化事实性能力。每条示例由以下部分组成： * 索引项（`original_index`）：标注该示例对应原始[SimpleQA](https://openai.com/index/introducing-simpleqa/)基准测试集中的问题编号 * 问题项（`problem`）：用于测试参数化知识的提示词，例如：*“梅赫巴巴·穆夫提·赛义德在2019年印度人民院选举中参选并落败的对手是谁？”* * 标准答案（`answer`）：配合评估提示词用于判定大语言模型回复正确性的标准答案 * 主题（`topic`）与答案类型（`answer_type`）分类：沿用原始[SimpleQA](https://openai.com/index/introducing-simpleqa/)论文中的分类标准，并在必要时进行重新分类 * 两个额外的元数据字段：`multi_step`与`requires_reasoning`，分别用于标注该问题是否需要整合多源信息，以及是否需要更复杂的推理过程 * 参考链接（`urls`）：至少包含2个可佐证标准答案的链接列表，这些链接源自SimpleQA的人工评审结果，并经SimpleQA Verified的研发团队调整完善有关评估方法的详细信息，请参阅[技术报告](https://arxiv.org/abs/2509.07968)。 ## 局限性说明 SimpleQA Verified 应在不借助任何工具（即搜索或检索工具）的前提下使用。若借助工具，该基准测试将极易被破解，违背其研发初衷。如有疑问、建议或问题，请前往[讨论论坛](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified/discussion?sort=hotness)与我们分享。 ## 评估提示词 SimpleQA Verified 采用GPT-4.1作为自动评分器时所使用的评估提示词，可在Kaggle平台的[入门笔记本](https://www.kaggle.com/code/nanliao7/simpleqa-verified-benchmark-starter-code)中获取。 ## 引用格式若您在研究中使用本数据集，请引用本技术报告： @misc{haas2025simpleqaverifiedreliablefactuality, title={SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge}, author={Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das}, year={2025}, eprint={2509.07968}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.07968}, }

提供机构：

maas

创建时间：

2025-09-23

搜集汇总

数据集介绍

背景与挑战

背景概述

SimpleQA Verified是一个由Google DeepMind和Google Research开发的1,000提示基准测试，用于可靠评估大型语言模型在短格式事实性和参数知识方面的表现。它基于OpenAI的SimpleQA改进，解决了原有限制，并使用GPT-4.1进行自动评分，旨在为研究社区提供更精确的工具以跟踪事实性进展。

以上内容由遇见数据集搜集并总结生成