ScaleAI/SciPredict
收藏Hugging Face2026-01-15 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/ScaleAI/SciPredict
下载链接
链接失效反馈官方服务:
资源简介:
SciPredict是一个基准测试数据集,用于评估AI系统在物理、生物和化学领域预测实验结果的能力。该数据集包含405个问题,这些问题来源于2025年3月后发表的实证研究,覆盖了33个子领域。数据集结构包括问题数量(405个问题,5,716行包含模型响应)、领域分布(物理9个子领域,化学10个子领域,生物14个子领域)和问题格式(多选题、自由格式、数值题)。关键字段包括科学领域、具体领域、问题格式、论文标题、论文URL、出版日期、实验设置描述、测量内容、预测任务问题、真实答案、专家整理的背景知识和相关论文信息。数据集的主要发现包括模型准确率在14-26%之间(与人类专家约20%的准确率相比)、模型校准问题、背景知识对性能的提升作用以及问题格式对性能的影响。
SciPredict is a benchmark evaluating whether AI systems can predict experimental outcomes in physics, biology, and chemistry. The dataset comprises 405 questions derived from recently published empirical studies (post-March 2025), spanning 33 subdomains. The dataset structure includes total questions (405 questions, 5,716 rows including model responses), domains (Physics with 9 subdomains, Chemistry with 10 subdomains, Biology with 14 subdomains), and question formats (Multiple-choice, Free-format, Numerical). Key fields include DOMAIN, FIELD, PQ_FORMAT, TITLE, URL, PUBLISHING_DATE, EXPERIMENTAL_SETUP, MEASUREMENT_TAKEN, OUTCOME_PREDICTION_QUESTION, GTA, BACKGROUND_KNOWLEDGE, and RELATED_PAPERS_DATA. Key findings include model accuracy (14-26% vs. ~20% human expert accuracy), poor calibration, the benefit of background knowledge in improving performance, and the impact of question format on performance.
提供机构:
ScaleAI



