AutoExperiment
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/j1mk1m/AutoExperiment
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为AutoExperiment,旨在评估智能体根据同行评审的研究论文运行实验的能力,通过遮蔽关键功能并评估智能体复现结果的能力。该数据集支持高达275,990个可能的样本,针对不同数量的遮蔽功能,每种设置最多可以选择100个样本进行评估。它包含85个独特的功能,可产生不同遮蔽级别的众多样本。该数据集的任务是评估人工智能智能体实施和运行机器学习实验的能力。
This dataset, named AutoExperiment, is designed to evaluate the ability of AI Agents to execute experiments based on peer-reviewed research papers, by masking key functions and assessing the agents' capability to reproduce experimental results. Supporting up to 275,990 potential samples, it allows selecting up to 100 samples for evaluation under each setting corresponding to different numbers of masked functions. Comprising 85 unique functions, it can generate a large number of samples with varying masking levels. The core task of this dataset is to evaluate the capability of AI Agents to implement and run machine learning experiments.
提供机构:
Open-sourced by the authors



