SimpleQA
收藏魔搭社区2026-01-09 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SimpleQA
下载链接
链接失效反馈官方服务:
资源简介:
# SimpleQA
SimpleQA is a factuality benchmark developed by OpenAI to evaluate the factual accuracy of language models when answering concise, fact-seeking questions. The dataset comprises 4,326 questions spanning diverse topics including science, technology, entertainment, and more.
## Dataset Description
SimpleQA measures the ability for language models to answer short, fact-seeking questions. Each question is designed to have a single, indisputable answer, ensuring straightforward grading and assessment.
### Key Features
- **High Correctness:** Reference answers are supported by sources from two independent AI trainers, ensuring reliability.
- **Diversity:** The dataset covers a wide range of subjects, providing a comprehensive evaluation tool.
- **Challenging for Frontier Models:** Designed to be more demanding than older benchmarks, SimpleQA presents a significant challenge for advanced models like GPT‑4o, which scores less than 40% on this benchmark.
- **Researcher-Friendly:** With concise questions and answers, SimpleQA allows for efficient evaluation and grading, making it a practical tool for researchers.
## Dataset Structure
### Data Fields
- `problem`: The fact-seeking question string
- `answer`: The reference answer string
- `metadata`: A dictionary containing:
- `topic`: The subject category of the question (e.g., "Science and technology", "Art")
- `answer_type`: The type of answer expected (e.g., "Person", "Number", "Location")
- `urls`: A list of URLs that support the reference answer
### Data Splits
- `test`: 4,321 questions for evaluation
- `few_shot`: 5 example questions for few-shot evaluation
## References
- [OpenAI Blog Post](https://openai.com/index/introducing-simpleqa/)
## License
See the original OpenAI release for license information.
# SimpleQA
SimpleQA 是由 OpenAI 开发的事实性评测基准,用于评估大语言模型(Large Language Model)在回答简洁的事实性查询问题时的事实准确性。该数据集包含4326道问题,涵盖科学、技术、娱乐等多个多样化主题。
## 数据集描述
SimpleQA 用于评测大语言模型回答简短事实性查询问题的能力。每道问题均预设唯一且无可争议的标准答案,便于开展直接明确的评分与评测工作。
### 核心特性
- **高正确性**:参考答案的验证来源均来自两名独立的AI标注人员,以此保障结果可靠性。
- **多样性**:数据集覆盖广泛的学科领域,可作为全面的评测工具使用。
- **前沿模型适配挑战**:相较于早期评测基准,本数据集设计更为严苛,对GPT-4o等先进模型构成显著挑战——此类模型在该基准上的得分不足40%。
- **科研友好性**:问题与答案均简洁明了,SimpleQA 可支持高效的评测与评分工作,成为科研人员的实用工具。
## 数据集结构
### 数据字段
- `problem`:事实性查询问题字符串
- `answer`:参考答案字符串
- `metadata`:包含以下内容的字典:
- `topic`:问题所属的主题类别(例如"Science and technology", "Art")
- `answer_type`:预期答案的类型(例如"Person", "Number", "Location")
- `urls`:支撑参考答案的URL列表
### 数据划分
- `test`:用于评测的4321道问题
- `few_shot`:用于少样本评测的5道示例问题
## 参考文献
- [OpenAI官方博客文章](https://openai.com/index/introducing-simpleqa/)
## 授权说明
授权信息请参见OpenAI官方原始发布内容。
提供机构:
maas
创建时间:
2025-03-10



