SimpleQA

Name: SimpleQA
Creator: maas
Published: 2026-01-09 18:01:45
License: 暂无描述

魔搭社区2026-01-09 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/SimpleQA

下载链接

链接失效反馈

官方服务：

资源简介：

# SimpleQA SimpleQA is a factuality benchmark developed by OpenAI to evaluate the factual accuracy of language models when answering concise, fact-seeking questions. The dataset comprises 4,326 questions spanning diverse topics including science, technology, entertainment, and more. ## Dataset Description SimpleQA measures the ability for language models to answer short, fact-seeking questions. Each question is designed to have a single, indisputable answer, ensuring straightforward grading and assessment. ### Key Features - **High Correctness:** Reference answers are supported by sources from two independent AI trainers, ensuring reliability. - **Diversity:** The dataset covers a wide range of subjects, providing a comprehensive evaluation tool. - **Challenging for Frontier Models:** Designed to be more demanding than older benchmarks, SimpleQA presents a significant challenge for advanced models like GPT‑4o, which scores less than 40% on this benchmark. - **Researcher-Friendly:** With concise questions and answers, SimpleQA allows for efficient evaluation and grading, making it a practical tool for researchers. ## Dataset Structure ### Data Fields - `problem`: The fact-seeking question string - `answer`: The reference answer string - `metadata`: A dictionary containing: - `topic`: The subject category of the question (e.g., "Science and technology", "Art") - `answer_type`: The type of answer expected (e.g., "Person", "Number", "Location") - `urls`: A list of URLs that support the reference answer ### Data Splits - `test`: 4,321 questions for evaluation - `few_shot`: 5 example questions for few-shot evaluation ## References - [OpenAI Blog Post](https://openai.com/index/introducing-simpleqa/) ## License See the original OpenAI release for license information.

# SimpleQA SimpleQA 是由 OpenAI 开发的事实性评测基准，用于评估大语言模型（Large Language Model）在回答简洁的事实性查询问题时的事实准确性。该数据集包含4326道问题，涵盖科学、技术、娱乐等多个多样化主题。 ## 数据集描述 SimpleQA 用于评测大语言模型回答简短事实性查询问题的能力。每道问题均预设唯一且无可争议的标准答案，便于开展直接明确的评分与评测工作。 ### 核心特性 - **高正确性**：参考答案的验证来源均来自两名独立的AI标注人员，以此保障结果可靠性。 - **多样性**：数据集覆盖广泛的学科领域，可作为全面的评测工具使用。 - **前沿模型适配挑战**：相较于早期评测基准，本数据集设计更为严苛，对GPT-4o等先进模型构成显著挑战——此类模型在该基准上的得分不足40%。 - **科研友好性**：问题与答案均简洁明了，SimpleQA 可支持高效的评测与评分工作，成为科研人员的实用工具。 ## 数据集结构 ### 数据字段 - `problem`：事实性查询问题字符串 - `answer`：参考答案字符串 - `metadata`：包含以下内容的字典： - `topic`：问题所属的主题类别（例如"Science and technology", "Art"） - `answer_type`：预期答案的类型（例如"Person", "Number", "Location"） - `urls`：支撑参考答案的URL列表 ### 数据划分 - `test`：用于评测的4321道问题 - `few_shot`：用于少样本评测的5道示例问题 ## 参考文献 - [OpenAI官方博客文章](https://openai.com/index/introducing-simpleqa/) ## 授权说明授权信息请参见OpenAI官方原始发布内容。

提供机构：

maas

创建时间：

2025-03-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集