TRAIL

Name: TRAIL
Creator: maas
Published: 2025-12-04 16:34:59
License: 暂无描述

魔搭社区2025-12-04 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/PatronusAI/TRAIL

下载链接

链接失效反馈

官方服务：

资源简介：

# Trace Reasoning and Agentic Issue Localization (TRAIL) <img src="https://i.imgur.com/BDk2QcM.jpeg" width="30%" height="30%" alt="TRAIL"/> TRAIL is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows. ## Dataset Details ### Dataset Description  TRAIL (Trace Reasoning and Agentic Issue Localization) is a new benchmark dataset designed to evaluate how well large language models can debug and identify errors in complex AI agent workflows. The dataset contains 148 meticulously annotated agent execution traces with 841 unique errors across a taxonomy of error categories spanning reasoning errors (like hallucinations), system execution errors (like API issues), and planning/coordination errors. TRAIL is constructed from real-world applications using the GAIA and SWE-Bench datasets, featuring both single and multi-agent systems tackling tasks in software engineering and information retrieval. The paper demonstrates that even state-of-the-art LLMs perform poorly on TRAIL, with the best model (Gemini-2.5-Pro) achieving only 11% joint accuracy. The benchmark is particularly challenging because it requires processing extremely long contexts that often exceed model context windows and demands significant output generation, making it valuable for improving LLMs' ability to evaluate complex agentic systems. - **Curated by:** Patronus AI - **Language(s) (NLP):** English - **License:** MIT License ### Dataset Sources  - **Repository:** https://github.com/patronus-ai/trail-benchmark - **Paper:** https://arxiv.org/abs/2505.08638 ### Out-of-Scope Use You must not use this dataset for training systems (AI models or otherwise) that are intended to automate human evaluation. This dataset is only meant for evaluation and benchmarking of such systems. ## Model Performance on TRAIL <img src="https://i.imgur.com/QeHGLAj.png" width="50%" height="50%" alt="TRAIL Results"/> ## Dataset Structure  The dataset consists of 148 traces (118 from GAIA and 30 from SWE-Bench) totaling 1,987 OpenTelemetry spans, of which 575 exhibit at least one error. The dataset is structured with trace-level annotations showing span IDs, error category types, supporting evidence, descriptions, and impact levels (Low/Medium/High) for each identified error. The dataset is split between the GAIA benchmark (open-world search tasks) and SWE-Bench (software engineering bug fixing), ensuring ecological validity across different agent applications. ## Dataset Creation ### Curation Rationale  The dataset was created to address the growing need for robust and dynamic evaluation methods for agentic workflow traces. As agentic systems become increasingly complex and widely adopted across domains, existing evaluation methods that rely on manual, domain-specific analysis of traces do not scale well. TRAIL provides a structured way to evaluate traces with a comprehensive taxonomy, enabling more systematic debugging and error analysis of complex agent behavior. ### Source Data  ### Data Collection and Processing  The dataset was created using text-only data instances from GAIA (for open-world search tasks) and SWE-Bench Lite (for software engineering bug fixing tasks). For GAIA traces, we used the Hugging Face OpenDeepResearch agent with o3-mini-2025-01-31 as the backbone model. For SWE-Bench, we used a CodeAct agent with claude-3-7-sonnet-20250219 as the backbone model, with added instructional constraints to organically introduce errors. All traces were collected using OpenTelemetry, specifically the OpenInference standard, ensuring compatibility with real-world tracing and observability software. ### Who are the source data producers?  The source data was produced by AI agent systems based on OpenAI's o3-mini and Anthropic's Claude models, executing tasks from the GAIA and SWE-Bench datasets. The traces capture the execution flows of these agents attempting to solve information retrieval and software engineering tasks. ### Annotations  ### Annotation process  Four expert annotators with backgrounds in software engineering and log debugging annotated the agent traces. Due to the lengthy traces (often exceeding maximum LLM context lengths), four independent rounds of verification were performed by ML researchers to ensure high quality. Annotators iterated over each LLM and tool span individually and in context, marking span ID, error category, evidence, description, and impact level. They also rated overall traces based on instruction adherence, plan optimality, security, and reliability. Interannotator agreement was high, with only 5.63% of spans modified in SWE-Bench and 5.31% in GAIA during review. ### Who are the annotators?  The annotations were created by four expert annotators with backgrounds in software engineering and log debugging, selected based on their age (18+) and expertise in computer science. The annotations were further verified by four industry ML researchers to ensure high quality. ### Personal and Sensitive Information  The dataset does not contain personal identifiable information (PII) or sensitive content. The traces were manually verified before being forwarded to annotators to ensure no explicit or biased content was included. ### Bias, Risks, and Limitations  The TRAIL dataset has the following limitations: - It is primarily focused on text-only inputs and outputs. - There is an imbalance in error categories, with Output Generation errors (particularly Formatting Errors and Instruction Non-compliance) accounting for nearly 42% of all errors. - ## Citation  **BibTeX:** ``` @misc{deshpande2025trail, title={TRAIL: Trace Reasoning and Agentic Issue Localization}, author={Darshan Deshpande and Varun Gangal and Hersh Mehta and Jitin Krishnan and Anand Kannappan and Rebecca Qian}, year={2025}, eprint={2505.08638}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.08638} } ``` **APA:** ``` Deshpande, D., Gangal, V., Mehta, H., Krishnan, J., Kannappan, A., & Qian, R. (2025). TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv. https://arxiv.org/abs/2505.08638 ``` ## Dataset Card Authors Darshan Deshpande ## Dataset Card Contact darshan@patronus.ai

# 轨迹推理与智能体问题定位（Trace Reasoning and Agentic Issue Localization，TRAIL） <img src="https://i.imgur.com/BDk2QcM.jpeg" width="30%" height="30%" alt="TRAIL 数据集示意图"/> TRAIL 是一款包含148条带标注的AI智能体（AI Agent）执行轨迹的基准数据集，涵盖推理、执行与规划三类共841处错误。该数据集源自真实世界的软件工程与信息检索任务，即便当前最先进的大语言模型（Large Language Model，LLM）在其上的表现也不尽如人意，表现最优的模型仅取得11%的准确率，凸显了复杂智能体工作流轨迹调试的难度。 ## 数据集详情 ### 数据集概述 TRAIL（轨迹推理与智能体问题定位）是一款全新的基准数据集，旨在评估大语言模型在复杂AI智能体工作流中调试与识别错误的能力。本数据集包含148条经过精细标注的智能体执行轨迹，涵盖推理错误（如幻觉问题）、系统执行错误（如API故障）以及规划/协调错误三大类共841处独特错误。TRAIL 基于 GAIA 与 SWE-Bench 数据集的真实应用场景构建，包含单智能体与多智能体系统，可完成软件工程与信息检索领域的任务。相关研究表明，即便当前最先进的大语言模型在 TRAIL 上的表现也欠佳，表现最优的模型（Gemini-2.5-Pro）仅实现11%的联合准确率。该基准测试极具挑战性，因为其需要处理远超模型上下文窗口的超长上下文，同时要求生成大量输出内容，因此对于提升大语言模型评估复杂智能体系统的能力具有重要价值。 - **整理方：** Patronus AI - **语言（自然语言处理）：** 英语 - **许可证：** MIT 许可证 ### 数据集来源 - **代码仓库：** https://github.com/patronus-ai/trail-benchmark - **相关论文：** https://arxiv.org/abs/2505.08638 ### 适用范围限制不得将本数据集用于训练旨在自动化人工评估的系统（包括AI模型或其他系统）。本数据集仅用于此类系统的评估与基准测试。 ## TRAIL 数据集上的模型性能 <img src="https://i.imgur.com/QeHGLAj.png" width="50%" height="50%" alt="TRAIL 数据集实验结果示意图"/> ## 数据集结构本数据集包含148条轨迹（其中118条来自 GAIA，30条来自 SWE-Bench），总计1987个 OpenTelemetry 跨度（span），其中575个跨度至少包含一处错误。数据集采用轨迹级标注结构，包含每个已识别错误的跨度ID、错误类别类型、佐证证据、错误描述以及影响等级（低/中/高）。数据集按照 GAIA 基准测试（开放域搜索任务）与 SWE-Bench（软件工程漏洞修复任务）进行划分，确保覆盖不同智能体应用场景的生态效度。 ## 数据集构建 ### 构建动机本数据集的构建旨在满足智能体工作流轨迹的鲁棒性与动态评估方法的日益增长的需求。随着智能体系统愈发复杂且在各领域得到广泛应用，现有的依赖人工、领域特定的轨迹分析方法难以实现规模化。TRAIL 提供了一种结构化的轨迹评估方式，搭配全面的错误分类体系，可实现对复杂智能体行为的系统化调试与错误分析。 ### 源数据 ### 数据收集与处理本数据集使用来自 GAIA（开放域搜索任务）与 SWE-Bench Lite（软件工程漏洞修复任务）的纯文本数据实例构建。对于 GAIA 轨迹，我们使用基于 Hugging Face OpenDeepResearch 智能体，以 o3-mini-2025-01-31 作为主干模型。对于 SWE-Bench 任务，我们使用 CodeAct 智能体，以 claude-3-7-sonnet-20250219 作为主干模型，并通过添加指导性约束来自然引入错误。所有轨迹均通过 OpenTelemetry（具体为 OpenInference 标准）采集，确保与真实世界的追踪与可观测性软件兼容。 ### 源数据生产者是谁？源数据由基于 OpenAI o3-mini 与 Anthropic Claude 模型的AI智能体系统生成，这些智能体执行来自 GAIA 与 SWE-Bench 数据集的任务，其执行流程被完整记录为轨迹。 ### 标注 ### 标注流程四位具备软件工程与日志调试背景的专家标注人员对智能体轨迹进行了标注。由于轨迹长度较长（时常超过大语言模型的最大上下文长度），我们由机器学习研究人员开展了四轮独立验证，以确保标注质量。标注人员逐一且结合上下文审核每个大语言模型与工具跨度，标记出跨度ID、错误类别、佐证证据、错误描述以及影响等级。此外，他们还基于指令遵循度、规划最优性、安全性与可靠性对整体轨迹进行评分。标注者间一致性较高，在 SWE-Bench 数据集中仅有5.63%的跨度在审核阶段被修改，GAIA 数据集的这一比例为5.31%。 ### 标注人员是谁？本次标注由四位具备软件工程与日志调试背景的专家标注人员完成，他们均年满18周岁且具备计算机科学相关专业知识。标注结果还由四位工业界机器学习研究人员进一步验证，以确保高质量输出。 ### 个人与敏感信息本数据集不包含个人可识别信息（Personally Identifiable Information，PII）或敏感内容。所有轨迹在提交给标注人员前均经过人工审核，确保未包含显性或带有偏见的内容。 ### 偏见、风险与局限性 TRAIL 数据集存在以下局限性： - 其主要聚焦于纯文本的输入与输出。 - 错误类别存在不平衡问题，输出生成类错误（尤其是格式错误与指令不遵从错误）占所有错误的近42%。 ## 引用 **BibTeX 格式：** @misc{deshpande2025trail, title={TRAIL: Trace Reasoning and Agentic Issue Localization}, author={Darshan Deshpande and Varun Gangal and Hersh Mehta and Jitin Krishnan and Anand Kannappan and Rebecca Qian}, year={2025}, eprint={2505.08638}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.08638} } **APA 格式：** Deshpande, D., Gangal, V., Mehta, H., Krishnan, J., Kannappan, A., & Qian, R. (2025). TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv. https://arxiv.org/abs/2505.08638 ## 数据集卡片作者 Darshan Deshpande ## 数据集卡片联系方式 darshan@patronus.ai

提供机构：

maas

创建时间：

2025-05-20

搜集汇总

数据集介绍