five

PNYX/gpqa_subtask

收藏
Hugging Face2025-09-17 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/PNYX/gpqa_subtask
下载链接
链接失效反馈
官方服务:
资源简介:
GPQA是一个由生物、物理和化学领域专家编写的极具挑战性的多项选择题数据集,共包含448个问题。这些问题质量高,难度极大,即使是相关领域的博士专家也只能达到65%的正确率(若排除专家事后识别的明显错误,正确率为74%),而即使是技能高超的非专家验证者在拥有无限制网络访问权限的情况下,平均花费超过30分钟,正确率也只有34%。该数据集对最先进的AI系统来说也极具挑战性,我们最强的基于GPT-4的基线模型准确率只有39%。数据集旨在用于可扩展的监督实验,以及用于评估大型语言模型的通用能力。

GPQA is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are of high quality and extremely difficult, with PhD-level experts in the corresponding domains achieving 65% accuracy (74% when discounting clear mistakes identified by the experts in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted web access. The dataset is also challenging for state-of-the-art AI systems, with our strongest GPT-4-based baseline achieving 39% accuracy. The dataset is intended for use in scalable oversight experiments and for benchmarking general capabilities of large language models.
提供机构:
PNYX
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作