inference-net/HALO-Gemini-3-Flash-AppWorld
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/inference-net/HALO-Gemini-3-Flash-AppWorld
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了在AppWorld基准上运行的Gemini 3 Flash模型的代理执行迹,特别是在`test-normal`数据集分割上的评估。迹捕获了模型与AppWorld模拟应用生态系统交互的完整跨度级执行细节。数据集包含168个迹,总计3,438个跨度,采用inference.net HALO迹格式。AppWorld是一个可控的大规模基准,旨在评估自主代理在现实、多步骤数字任务上的表现,模拟了包括电子邮件、日历、银行、消息和文件存储等日常智能手机应用的丰富生态系统。迹格式为HALO(分层代理延迟和观察),是inference.net的标准模式,用于表示代理执行迹为结构化跨度树,捕获每个步骤的输入、输出、工具调用和延迟。数据集以单个JSONL/JSON迹文件形式提供,每个条目代表代理执行迹中的一个跨度,通过迹ID分层链接,形成168个任务剧集的完整执行树。预期用途包括RLM分析、基准研究和代理训练。
This dataset contains agent execution traces of Gemini 3 Flash running on the AppWorld benchmark, specifically evaluated on the `test-normal` dataset split. The traces capture the full span-level execution detail of the model interacting with AppWorlds simulated app ecosystem. The dataset includes 168 traces with a total of 3,438 spans, formatted in inference.net HALO trace format. AppWorld is a controllable, large-scale benchmark designed to evaluate autonomous agents on realistic, multi-step digital tasks, simulating a rich ecosystem of everyday smartphone apps including email, calendar, banking, messaging, and file storage. The trace format is HALO (Hierarchical Agent Latency & Observation), inference.nets standardized schema for representing agent execution traces as structured span trees, capturing inputs, outputs, tool calls, and latency at each step. The dataset is provided as a single JSONL/JSON trace file where each entry represents a span within an agent execution trace, linked hierarchically by trace ID to form full execution trees for each of the 168 task episodes. Intended uses include RLM Analysis, Benchmark Research, and Agent Training.
提供机构:
inference-net



