five

ScienceAgentBench

收藏
魔搭社区2025-11-27 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/osunlp/ScienceAgentBench
下载链接
链接失效反馈
官方服务:
资源简介:
## ScienceAgentBench The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery: - To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. - We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. - Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. ## Benchmark Access To prevent benchmark data contamination, we only provide the annotation sheet on Huggingface, which includes all necessary *inputs* to run an agent. To evaluate the agent outcomes, i.e. generated code, please follow the instructions in our [github repository](https://github.com/OSU-NLP-Group/ScienceAgentBench). ## Benchmark Structure - "instance_id" (str): unique id for each task - "domain" (str): scientific discipline of each task - "subtask_categories" (str): sub-tasks involved in each task - "github_name" (str): the original github repository each task is adapted from - "task_inst" (str): task goal description and output formatting instruction - "domain_knowledge" (str): expert-annotated information about the task - "dataset_folder_tree" (str): string representation of dataset directory structure for each task - "dataset_preview" (str): string representation of the first few examples/lines in dataset files used in each task - "src_file_or_path" (str): source program location in the original github repository that is adapted - "gold_program_name" (str): name of annotated program (reference solution) for each task - "output_fname" (str): output location to save the generated program for each task - "eval_script_name" (str): name of evaluation script to check success criteria for each task ## Licensing Information Most tasks in ScienceAgentBench is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>. We retain their original licenses for tasks adapted from [rasterio/rasterio](https://github.com/rasterio/rasterio?tab=License-1-ov-file) (Instance ID: 32, 46, 53, 54, 84) and [hackingmaterials/matminer](https://github.com/hackingmaterials/matminer?tab=License-1-ov-file) (Instance ID: 3). ## Disclaimer Our benchmark is constructed by adapting open-source code and data, to which we respect their creators' ownership and intellectual property. In Appendix I of our paper, we have made our best effort to cite the original papers, list the repositories, and provide their licenses. Still, we acknowledge that two repositories ([rasterio/rasterio](https://github.com/rasterio/rasterio) and [hackingmaterials/matminer](https://github.com/hackingmaterials/matminer)) are copyrighted and believe their terms for use are compatible with our research purpose. We welcome requests from the original authors to modify or remove relevant tasks related to those two repositories if needed. ## Citation If you find our code and data useful, please consider citing our paper: ``` @misc{chen2024scienceagentbenchrigorousassessmentlanguage, title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery}, author={Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu-Ampratwum and Xuhui Huang and Xia Ning and Song Gao and Yu Su and Huan Sun}, year={2024}, eprint={2410.05080}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.05080}, } ```

大语言模型(Large Language Model, LLM)的发展引发了学界对研发基于大语言模型的语言AI智能体(AI Agent)以实现端到端自动化科学发现的浓厚兴趣,这一方向既激发了广泛期待,也引发了对其真实能力的质疑与讨论。 本研究呼吁,在对端到端自动化做出大胆断言之前,应先在科学工作流的单个任务场景中对AI智能体开展严格评估。 为此,我们提出ScienceAgentBench——一款用于评估面向数据驱动科学发现的语言AI智能体的全新基准测试集: - 为保障基准测试的科学真实性与现实应用相关性,我们从四个学科的44篇同行评审出版物中提取了102个任务,并邀请9位领域专家对其进行验证。 - 我们将所有任务的目标输出统一为可独立运行的Python程序文件,并采用多维度评估指标来检验生成的程序、执行结果与执行成本。 - 每个任务均经过标注人员与领域专家的多轮手动验证,以确保标注质量与科学合理性。 ## 基准访问 为防止基准数据被污染,我们仅在Huggingface平台提供标注表单,其中包含运行AI智能体所需的全部输入数据。 若要评估智能体的输出结果(即生成的代码),请遵循我们[GitHub仓库](https://github.com/OSU-NLP-Group/ScienceAgentBench)中的说明。 ## 基准结构 - "instance_id"(字符串):每个任务的唯一标识符 - "domain"(字符串):每个任务所属的科学学科 - "subtask_categories"(字符串):每个任务涉及的子任务类别 - "github_name"(字符串):每个任务所改编自的原始GitHub仓库名称 - "task_inst"(字符串):任务目标描述与输出格式要求 - "domain_knowledge"(字符串):专家标注的任务相关领域知识 - "dataset_folder_tree"(字符串):每个任务的数据集目录结构的字符串表示 - "dataset_preview"(字符串):每个任务所用数据集文件的前若干示例/行的字符串表示 - "src_file_or_path"(字符串):所改编的原始GitHub仓库中的源程序位置 - "gold_program_name"(字符串):每个任务的标注程序(参考解决方案)名称 - "output_fname"(字符串):保存每个任务生成程序的输出路径 - "eval_script_name"(字符串):用于检验每个任务成功标准的评估脚本名称 ## 授权信息 ScienceAgentBench中的绝大多数任务采用<a rel="license" href="http://creativecommons.org/licenses/by/4.0/">知识共享署名4.0国际许可协议</a>进行授权。 对于改编自[rasterio/rasterio](https://github.com/rasterio/rasterio?tab=License-1-ov-file)(实例ID:32、46、53、54、84)与[hackingmaterials/matminer](https://github.com/hackingmaterials/matminer?tab=License-1-ov-file)(实例ID:3)的任务,我们保留其原始授权协议。 ## 免责声明 本基准通过改编开源代码与数据构建,我们尊重其创作者的所有权与知识产权。在本文的附录I中,我们已尽最大努力引用原始论文、列出相关仓库并提供其授权协议。尽管如此,我们仍需说明,[rasterio/rasterio](https://github.com/rasterio/rasterio)与[hackingmaterials/matminer](https://github.com/hackingmaterials/matminer)这两个仓库受版权保护,我们认为其使用条款与本研究的目的兼容。若原作者有相关需求,我们欢迎其提出修改或移除与这两个仓库相关的任务的请求。 ## 引用 如果您认为我们的代码与数据对您的研究有所帮助,请考虑引用我们的论文: @misc{chen2024scienceagentbenchrigorousassessmentlanguage, title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery}, author={Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu-Ampratwum and Xuhui Huang and Xia Ning and Song Gao and Yu Su and Huan Sun}, year={2024}, eprint={2410.05080}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.05080}, }
提供机构:
maas
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作