five

SimpleQA

收藏
魔搭社区2026-01-09 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SimpleQA
下载链接
链接失效反馈
官方服务:
资源简介:
# SimpleQA SimpleQA is a factuality benchmark developed by OpenAI to evaluate the factual accuracy of language models when answering concise, fact-seeking questions. The dataset comprises 4,326 questions spanning diverse topics including science, technology, entertainment, and more. ## Dataset Description SimpleQA measures the ability for language models to answer short, fact-seeking questions. Each question is designed to have a single, indisputable answer, ensuring straightforward grading and assessment. ### Key Features - **High Correctness:** Reference answers are supported by sources from two independent AI trainers, ensuring reliability. - **Diversity:** The dataset covers a wide range of subjects, providing a comprehensive evaluation tool. - **Challenging for Frontier Models:** Designed to be more demanding than older benchmarks, SimpleQA presents a significant challenge for advanced models like GPT‑4o, which scores less than 40% on this benchmark. - **Researcher-Friendly:** With concise questions and answers, SimpleQA allows for efficient evaluation and grading, making it a practical tool for researchers. ## Dataset Structure ### Data Fields - `problem`: The fact-seeking question string - `answer`: The reference answer string - `metadata`: A dictionary containing: - `topic`: The subject category of the question (e.g., "Science and technology", "Art") - `answer_type`: The type of answer expected (e.g., "Person", "Number", "Location") - `urls`: A list of URLs that support the reference answer ### Data Splits - `test`: 4,321 questions for evaluation - `few_shot`: 5 example questions for few-shot evaluation ## References - [OpenAI Blog Post](https://openai.com/index/introducing-simpleqa/) ## License See the original OpenAI release for license information.

# SimpleQA SimpleQA 是由 OpenAI 开发的事实性评测基准,用于评估大语言模型(Large Language Model)在回答简洁的事实性查询问题时的事实准确性。该数据集包含4326道问题,涵盖科学、技术、娱乐等多个多样化主题。 ## 数据集描述 SimpleQA 用于评测大语言模型回答简短事实性查询问题的能力。每道问题均预设唯一且无可争议的标准答案,便于开展直接明确的评分与评测工作。 ### 核心特性 - **高正确性**:参考答案的验证来源均来自两名独立的AI标注人员,以此保障结果可靠性。 - **多样性**:数据集覆盖广泛的学科领域,可作为全面的评测工具使用。 - **前沿模型适配挑战**:相较于早期评测基准,本数据集设计更为严苛,对GPT-4o等先进模型构成显著挑战——此类模型在该基准上的得分不足40%。 - **科研友好性**:问题与答案均简洁明了,SimpleQA 可支持高效的评测与评分工作,成为科研人员的实用工具。 ## 数据集结构 ### 数据字段 - `problem`:事实性查询问题字符串 - `answer`:参考答案字符串 - `metadata`:包含以下内容的字典: - `topic`:问题所属的主题类别(例如"Science and technology", "Art") - `answer_type`:预期答案的类型(例如"Person", "Number", "Location") - `urls`:支撑参考答案的URL列表 ### 数据划分 - `test`:用于评测的4321道问题 - `few_shot`:用于少样本评测的5道示例问题 ## 参考文献 - [OpenAI官方博客文章](https://openai.com/index/introducing-simpleqa/) ## 授权说明 授权信息请参见OpenAI官方原始发布内容。
提供机构:
maas
创建时间:
2025-03-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作