EXP-Bench
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了461个人工智能研究任务,旨在评估人工智能代理在进行基于具有影响力的人工智能研究论文的端到端实验时的能力。这些任务涵盖了人工智能的多个子领域,包括计算机视觉、自然语言处理和强化学习。每个任务条目均包含研究问题、高级方法以及代码库的访问权限。该数据集的规模涉及51篇论文中的461个任务,共计12737个子任务,其核心任务是评估人工智能代理在完整研究实验上的表现。
This dataset comprises 461 AI research tasks, aimed at evaluating the capabilities of AI Agents when performing end-to-end experiments rooted in influential artificial intelligence research papers. These tasks span multiple subfields of AI, including computer vision, natural language processing, and reinforcement learning. Each task entry contains the corresponding research question, high-level methodological frameworks, and access to the associated code repository. Derived from 51 academic papers, the dataset includes 461 primary tasks and a total of 12,737 subtasks, with its core objective being to assess the performance of AI Agents on full-scale research experiments.
提供机构:
Curated from top-tier AI research papers



