SimpleQA-Bench

Name: SimpleQA-Bench
Creator: maas
Published: 2026-01-07 02:01:05
License: 暂无描述

魔搭社区2026-01-07 更新2026-01-10 收录

下载链接：

https://modelscope.cn/datasets/PAI/SimpleQA-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# SimpleQA-Bench Tags: `factuality`, `EN`, `ZH`, `short-form-answer`, `human-label` Copyright: © 2024 alibaba-pai Source. OpenAI's SimpleQA: [Blog & Paper](https://openai.com/index/introducing-simpleqa/) / [Data & simple-evals Project](https://github.com/openai/simple-evals/) OpenStellarTeam's Chinese-SimpleQA: [Blog & Paper](https://openstellarteam.github.io/ChineseSimpleQA/), [Data@HF](https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA) > Factuality is a complicated topic because it is hard to measure—evaluating the factuality of any given arbitrary claim is challenging, and language models can generate long completions that contain dozens of factual claims. In SimpleQA, we will focus on short, fact-seeking queries, which reduces the scope of the benchmark but makes measuring factuality much more tractable. ## Data Combine SimpleQA and Chinese-SimpleQA data and further process them into Multi-Choice Question (MCQ) format. The original two datasets involve a lot of long-tail and niche knowledge. As a result, the accuracy of direct QA responses from LLMs is generally low (for example, o1-preview and gpt-4o-2024-11-20 have accuracies of 0.424 (SOTA) and 0.388 on SimpleQA, respectively). In some scenarios (e.g., evaluation), the factuality of LLMs also refers to the model's ability to distinguish the correctness of candidate answers, rather than directly providing the correct answer. Therefore, we asked GPT-4o to generate 3 plausible but incorrect candidate answers for each QA, thus converting the original QA data into an MCQ format. In total, we successfully transformed 4,326 (SimpleQA) + 2,998 (Chinese-SimpleQA) = 7,324 (Chinese-SimpleQA) samples. Data fields and descriptions see below: | Field | Description | SimpleQA Example | Chinese-SimpleQA Example | | --- | --- | --- | --- | | `dataset` (str) | dataset name | openai/SimpleQA | OpenStellarTeam/Chinese-SimpleQA | | `metadata` (str) | data meta info, including topic, source urls, et al., using json.loads to use the metainfo | after json.loads, we may get a dict like {"topic": "Science and technology", "answer_type": "Person", "urls": ["https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://ieeexplore.ieee.org/author/37271220500", "https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://www.nxtbook.com/nxtbooks/ieee/awards_2010/index.php?startid=21#/p/20"]} | after json.loads, we may get a dict like {"id": "6fd2645ad3994c89a01acae98cf04f90", "primary_category": "自然与自然科学", "secondary_category": "资讯科学", "urls": ["https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2"]} | | `question` (str) | Qeustion | Who received the IEEE Frank Rosenblatt Award in 2010? | 蒙特卡洛树搜索最初由哪位研究人员在1987年的博士论文中探索，并首次提出了其关键特性？ | | `answer` (str) | Human verified short-form answer | Michio Sugeno | 布鲁斯·艾布拉姆森（Bruce Abramson） | | `messages` (List[Dict]) | messages in openai standard to answer the MCQ (four-shot), see `ANSWER_MCQ_PROMPT` in the below code for details | [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "# Objective ... Answers: "}] | *The same* | | `options` (List[str]) | All options with IDs A/B/C/D | ["Lotfi Zadeh", "Michio Sugeno", "John McCarthy", "Stephen Grossberg"] | ["布鲁斯·艾布拉姆森（Bruce Abramson）", "勒努瓦·波维尔（Lennart Batsch-Fischer）", "克里斯·沃特森（Chris Watkins）", "马丁·汉森（Martin Hansen）"] | | `answer_option` (str) |correct option ID：A/B/C/D | B | A | ## Prompts of extra answers and messages ```python # -*- coding: utf-8 -*- # Author: renjun.hrj # Data: 2024-12-03 GEN_WA_RROMPT = """\ # Objective Convert a question-answer pair into a valid multi-choice question. # Detailed Instructions You are given a question and its correwponding ground-truth answer. You are kindly asked to come up with three extra answers that are pausible but incorrect \ (i.e., must be semantically different to the ground-truth answer). The QA as well as the \ three incorrect answers could then be turned into a multiple-choice question. By pausible, we mean that the incorrect answers should be similar in content and format \ to, and have some connection with the ground-truth answer. For instance: if the ground-truth answer is a four-digit year, those generated extra answers \ could possible be four-digit years close to the ground-truth one; if the ground-truth answer \ is a person name, those generated extra answers could possibly be other persons in the context; \ if the ground-truth answer is a country name, those generated extra answers could be other \ countries geographically or culturally close to the ground-truth one, etc. # Response Format Please return a JSON object with three fileds: answer1, answer2, and answer3, e.g., \ {{"answer1": "placeholder", "answer2": "placeholder", "answer3": "placeholder"}} # Examples ## Example 1 Question: 商阳穴位于人体哪个部位？ Ground-truth Answer: 手 Generated Extra Answers: {{"answer1": "脚", "answer2": "背", "answer3": "腰"}} ## Example 2 Question: 在二十八宿中，白虎象征着哪个方位的七宿？ Ground-truth Answer: 西方 Generated Extra Answers: {{"answer1": "北方", "answer2": "东方", "answer3": "南方"}} ## Example 3 Question: 国际DOI基金会成立于哪一年？ Ground-truth Answer: 1998 Generated Extra Answers: {{"answer1": "1996", "answer2": "2000", "answer3": "2002"}} ## Example 4 Question: Who was the 2nd chairman of the Senate of Pakistan? Ground-truth Answer: Ghulam Ishaq Khan Generated Extra Answers: {{"answer1": "Habibullah Khan", "answer2": "Wasim Sajjad", "answer3": "Mohamad Mian Soomro"}} ## Example 5 Question: With how many points did Romania finish the 2022 Rugby Europe Championship? Ground-truth Answer: 14 Generated Extra Answers: {{"answer1": "12", "answer2": "15", "answer3": "16"}} ## Example 6 Question: In what subject did photographer Kemka Ajoku attain a bachelor's degree in 2020? Ground-truth Answer: Mechanical Engineering Generated Extra Answers: {{"answer1": "Electronic Engineering", "answer2": "Computer Science and Engineering", "answer3": "Art Design"}} # Input Question: {question} Ground-truth Answer: {answer} Generated Extra Answers: \ """ ANSWER_MCQ_PROMPT = """\ # Objective Answer this multiple choice question by directly choosing the correct option. # Examples ## Example 1 Question: 商阳穴位于人体哪个部位？ Options: - A. 手 - B. 脚 - C. 背 - D. 腰 Answer: A ## Example 2 Question: 在二十八宿中，白虎象征着哪个方位的七宿？ Options: - A. 东方 - B. 南方 - C. 西方 - D. 北方 Answer: C ## Example 3 Question: Who was the 2nd chairman of the Senate of Pakistan? Options: - A. Habibullah Khan - B. Mohamad Mian Soomro - C. Wasim Sajjad - D. Ghulam Ishaq Khan Answer: D ## Example 4 Question: With how many points did Romania finish the 2022 Rugby Europe Championship? Options: - A. 12 - B. 14 - C. 15 - D. 16 Answer: B # Input Question: {question} Options: - A. {opa} - B. {opb} - C. {opc} - D. {opd} Answers: \ """ def generage_simple_qa_msgs(template, **kwargs): if template == "gen_wa": return [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": GEN_WA_RROMPT.format(**kwargs)} ] elif template == "amcq": return [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": ANSWER_MCQ_PROMPT.format(**kwargs)} ] else: raise ValueError(f"Unknown template: {template}") ``` ## Performance Comparison for QA & MCQ | LLM | SimpleQA (4326) | SimpleQA-MCQ | Chinese-SimpleQA (2998) | Chinese-SimpleQA-MCQ | | --- | --- | --- | --- | --- | | gpt-4o-mini-2024-07-18 | 9.5 | 41.2 (1781/4326) | 37.6 | 52.9 (1586/2997) | | qwen-max | / | 52.5 (2256/4300) | 54.1 | 72.7 (2177/2996) |

# SimpleQA-Bench 标签：`事实性（factuality）`、`英文（EN）`、`中文（ZH）`、`短格式回答（short-form-answer）`、`人工标注（human-label）` 版权：© 2024 alibaba-pai 数据集来源： OpenAI 官方 SimpleQA：[博客与论文](https://openai.com/index/introducing-simpleqa/) / [数据集与simple-evals项目](https://github.com/openai/simple-evals/) OpenStellarTeam 团队中文 SimpleQA：[博客与论文](https://openstellarteam.github.io/ChineseSimpleQA/)，[数据集@Hugging Face](https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA) > 事实性是一个复杂的研究议题，因其难以量化：对任意给定主张的事实性进行评估颇具挑战，而大语言模型（Large Language Model, LLM）生成的长文本补全结果往往包含数十条事实性表述。为此，SimpleQA基准聚焦于简短的事实查询类问题，此举虽缩小了基准测试的覆盖范围，却大幅提升了事实性量化评估的可行性。 ## 数据集构建将SimpleQA与Chinese-SimpleQA原始数据集合并，并进一步处理为多项选择题（Multiple-Choice Question, MCQ）格式。原始两个数据集涵盖大量长尾与小众知识，因此大语言模型直接作答的准确率普遍偏低（例如，o1-preview与gpt-4o-2024-11-20在SimpleQA上的准确率分别为0.424（当前最优，State-of-the-Art, SOTA）与0.388）。在部分场景（如模型评估）中，大语言模型的事实性能力也可指代其区分候选答案正确性的能力，而非直接生成正确答案。因此，我们借助GPT-4o为每个问答对生成3个看似合理但错误的候选答案，从而将原始问答数据转换为MCQ格式。总计成功转换了4326条（SimpleQA）+2998条（Chinese-SimpleQA）共7324条样本。数据集字段与说明如下表所示： | 字段名 | 说明 | SimpleQA 示例 | Chinese-SimpleQA 示例 | | --- | --- | --- | --- | | `dataset` (str) | 数据集名称 | openai/SimpleQA | OpenStellarTeam/Chinese-SimpleQA | | `metadata` (str) | 数据元信息，包含主题、来源URL等，需通过`json.loads`解析以使用元数据 | 经`json.loads`解析后可得到类似如下字典：`{"topic": "Science and technology", "answer_type": "Person", "urls": ["https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://ieeexplore.ieee.org/author/37271220500", "https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://www.nxtbook.com/nxtbooks/ieee/awards_2010/index.php?startid=21#/p/20"]}` | 经`json.loads`解析后可得到类似如下字典：`{"id": "6fd2645ad3994c89a01acae98cf04f90", "primary_category": "自然与自然科学", "secondary_category": "资讯科学", "urls": ["https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2"]}` | | `question` (str) | 问题 | Who received the IEEE Frank Rosenblatt Award in 2010? | 蒙特卡洛树搜索最初由哪位研究人员在1987年的博士论文中探索，并首次提出了其关键特性？ | | `answer` (str) | 人工验证的短格式答案 | Michio Sugeno | 布鲁斯·艾布拉姆森（Bruce Abramson） | | `messages` (List[Dict]) | 符合OpenAI标准的MCQ作答对话格式（含四样本提示），详细说明可参见下文代码中的`ANSWER_MCQ_PROMPT` | `[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "# Objective ... Answers: "}]` | *与英文示例一致* | | `options` (List[str]) | 包含A/B/C/D的全部候选答案 | `["Lotfi Zadeh", "Michio Sugeno", "John McCarthy", "Stephen Grossberg"]` | `["布鲁斯·艾布拉姆森（Bruce Abramson）", "勒努瓦·波维尔（Lennart Batsch-Fischer）", "克里斯·沃特森（Chris Watkins）", "马丁·汉森（Martin Hansen）"]` | | `answer_option` (str) | 正确选项ID：A/B/C/D | B | A | ## 额外答案与对话格式提示词 python # -*- coding: utf-8 -*- # 作者: renjun.hrj # 日期: 2024-12-03 GEN_WA_RROMPT = """ # 任务目标将一组问答对转换为合法的多项选择题格式。 # 详细要求你将获得一个问题及其对应的标准答案。请你生成3个看似合理但错误的额外答案（即语义上与标准答案存在差异）。随后，该问答对与这3个错误答案即可组合为一道多项选择题。所谓“看似合理”，指的是错误答案应在内容与格式上与标准答案相似，并与标准答案存在一定关联。例如：若标准答案为四位数年份，则生成的错误答案可选取与标准答案相近的四位数年份；若标准答案为人名，则错误答案可选取上下文提及的其他人物；若标准答案为国名，则错误答案可选取地理或文化上与标准答案相近的其他国家，以此类推。 # 返回格式请返回一个包含三个字段的JSON对象：answer1、answer2、answer3，示例如下： {{"answer1": "placeholder", "answer2": "placeholder", "answer3": "placeholder"}} # 示例 ## 示例1 问题：商阳穴位于人体哪个部位？标准答案：手生成的错误答案：{{"answer1": "脚", "answer2": "背", "answer3": "腰"}} ## 示例2 问题：在二十八宿中，白虎象征着哪个方位的七宿？标准答案：西方生成的错误答案：{{"answer1": "北方", "answer2": "东方", "answer3": "南方"}} ## 示例3 问题：国际DOI基金会成立于哪一年？标准答案：1998 生成的错误答案：{{"answer1": "1996", "answer2": "2000", "answer3": "2002"}} ## 示例4 问题：Who was the 2nd chairman of the Senate of Pakistan? 标准答案：Ghulam Ishaq Khan 生成的错误答案：{{"answer1": "Habibullah Khan", "answer2": "Wasim Sajjad", "answer3": "Mohamad Mian Soomro"}} ## 示例5 问题：With how many points did Romania finish the 2022 Rugby Europe Championship? 标准答案：14 生成的错误答案：{{"answer1": "12", "answer2": "15", "answer3": "16"}} ## 示例6 问题：In what subject did photographer Kemka Ajoku attain a bachelor's degree in 2020? 标准答案：Mechanical Engineering 生成的错误答案：{{"answer1": "Electronic Engineering", "answer2": "Computer Science and Engineering", "answer3": "Art Design"}} # 输入问题: {question} 标准答案: {answer} 生成的错误答案: """ ANSWER_MCQ_PROMPT = """ # 任务目标通过直接选择正确选项来回答该多项选择题。 # 示例 ## 示例1 问题：商阳穴位于人体哪个部位？选项： - A. 手 - B. 脚 - C. 背 - D. 腰答案：A ## 示例2 问题：在二十八宿中，白虎象征着哪个方位的七宿？选项： - A. 东方 - B. 南方 - C. 西方 - D. 北方答案：C ## 示例3 问题：Who was the 2nd chairman of the Senate of Pakistan? 选项： - A. Habibullah Khan - B. Mohamad Mian Soomro - C. Wasim Sajjad - D. Ghulam Ishaq Khan 答案：D ## 示例4 问题：With how many points did Romania finish the 2022 Rugby Europe Championship? 选项： - A. 12 - B. 14 - C. 15 - D. 16 答案：B # 输入问题: {question} 选项： - A. {opa} - B. {opb} - C. {opc} - D. {opd} 答案: """ def generage_simple_qa_msgs(template, **kwargs): if template == "gen_wa": return [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": GEN_WA_RROMPT.format(**kwargs)} ] elif template == "amcq": return [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": ANSWER_MCQ_PROMPT.format(**kwargs)} ] else: raise ValueError(f"Unknown template: {template}") ## QA与MCQ任务性能对比 | 大语言模型 | SimpleQA（4326条样本） | SimpleQA-MCQ | Chinese-SimpleQA（2998条样本） | Chinese-SimpleQA-MCQ | | --- | --- | --- | --- | --- | | gpt-4o-mini-2024-07-18 | 9.5 | 41.2 (1781/4326) | 37.6 | 52.9 (1586/2997) | | qwen-max | / | 52.5 (2256/4300) | 54.1 | 72.7 (2177/2996) |

提供机构：

maas

创建时间：

2025-12-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集