yoonsanglee/researchrubrics-react

Name: yoonsanglee/researchrubrics-react
Creator: yoonsanglee
Published: 2026-04-29 18:00:13
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/yoonsanglee/researchrubrics-react

下载链接

链接失效反馈

官方服务：

资源简介：

AggAgent是一个代理聚合框架，通过在测试时从基础代理中采样多个并行rollout并聚合它们的证据和解决方案来扩展长视野代理。此数据集卡发布了AggAgent使用的ReAct基础rollout，即在任何聚合步骤之前生成的单代理轨迹。每个rollout是通过对基准提示运行ReAct风格的深度研究代理（推理→工具调用→观察→...→最终解决方案）生成的。代理框架改编自Tongyi DeepResearch。轨迹包括完整的消息流、提取的预测、工具/rollout成本核算和自动判断结果，因此可以直接用于Best-of-N选择、聚合器训练或基础策略的行为分析。此版本涵盖三个开放权重骨干：GLM-4.7-Flash、MiniMax-M2.5和Qwen3.5-122B-A10B。每个骨干作为一个单独的Parquet文件发布。每个基准实例存储了8个并行rollout（参见metadata）。注意：BrowseComp和BrowseComp-Plus的rollout故意不在Hugging Face上分发，以避免这些评估的网络爬取污染，它们仅通过GitHub repo发布。其余基准（DeepSearchQA、HealthBench、HLE、ResearchRubrics）的rollout在此发布。

AggAgent is an agentic aggregation framework that scales long-horizon agents at test time by sampling multiple parallel rollouts from a base agent and then aggregating their evidence and solutions. This dataset card releases the ReAct base rollouts that AggAgent consumes, i.e. single-agent trajectories produced before any aggregation step. Each rollout was generated by running a ReAct-style deep-research agent (reasoning → tool call → observation → ... → final solution) against the benchmark prompts. The agent scaffold is adapted from Tongyi DeepResearch. The trajectories include the full message stream, the extracted prediction, tool/rollout cost accounting, and an auto-judge verdict, so they can be used directly for Best-of-N selection, aggregator training, or behavioural analysis of the base policy. This release covers three open-weights backbones: GLM-4.7-Flash, MiniMax-M2.5, and Qwen3.5-122B-A10B. Each backbone is shipped as a single Parquet file. roll_out_count = 8 parallel rollouts are stored per benchmark instance (see metadata). Note: Rollouts for BrowseComp and BrowseComp-Plus are intentionally not distributed on Hugging Face — to limit web-crawl contamination of these evals, they are released only as tar archives via the GitHub repo. Rollouts for the remaining benchmarks (DeepSearchQA, HealthBench, HLE, ResearchRubrics) are released here.

提供机构：

yoonsanglee

5,000+

优质数据集

54 个

任务类型

进入经典数据集