TRAIL
收藏魔搭社区2025-12-04 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/PatronusAI/TRAIL
下载链接
链接失效反馈官方服务:
资源简介:
# Trace Reasoning and Agentic Issue Localization (TRAIL)
<img src="https://i.imgur.com/BDk2QcM.jpeg" width="30%" height="30%" alt="TRAIL"/>
TRAIL is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
TRAIL (Trace Reasoning and Agentic Issue Localization) is a new benchmark dataset designed to evaluate how well large language models can debug and identify errors in complex AI agent workflows.
The dataset contains 148 meticulously annotated agent execution traces with 841 unique errors across a taxonomy of error categories spanning reasoning errors (like hallucinations), system execution errors (like API issues), and planning/coordination errors.
TRAIL is constructed from real-world applications using the GAIA and SWE-Bench datasets, featuring both single and multi-agent systems tackling tasks in software engineering and information retrieval.
The paper demonstrates that even state-of-the-art LLMs perform poorly on TRAIL, with the best model (Gemini-2.5-Pro) achieving only 11% joint accuracy.
The benchmark is particularly challenging because it requires processing extremely long contexts that often exceed model context windows and demands significant output generation, making it valuable for improving LLMs' ability to evaluate complex agentic systems.
- **Curated by:** Patronus AI
- **Language(s) (NLP):** English
- **License:** MIT License
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/patronus-ai/trail-benchmark
- **Paper:** https://arxiv.org/abs/2505.08638
### Out-of-Scope Use
You must not use this dataset for training systems (AI models or otherwise) that are intended to automate human evaluation. This dataset is only meant for evaluation and benchmarking of such systems.
## Model Performance on TRAIL
<img src="https://i.imgur.com/QeHGLAj.png" width="50%" height="50%" alt="TRAIL Results"/>
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
The dataset consists of 148 traces (118 from GAIA and 30 from SWE-Bench) totaling 1,987 OpenTelemetry spans, of which 575 exhibit at least one error. The dataset is structured with trace-level annotations showing span IDs, error category types, supporting evidence, descriptions, and impact levels (Low/Medium/High) for each identified error. The dataset is split between the GAIA benchmark (open-world search tasks) and SWE-Bench (software engineering bug fixing), ensuring ecological validity across different agent applications.
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
The dataset was created to address the growing need for robust and dynamic evaluation methods for agentic workflow traces.
As agentic systems become increasingly complex and widely adopted across domains, existing evaluation methods that rely on manual, domain-specific analysis of traces do not scale well.
TRAIL provides a structured way to evaluate traces with a comprehensive taxonomy, enabling more systematic debugging and error analysis of complex agent behavior.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
The dataset was created using text-only data instances from GAIA (for open-world search tasks) and SWE-Bench Lite (for software engineering bug fixing tasks).
For GAIA traces, we used the Hugging Face OpenDeepResearch agent with o3-mini-2025-01-31 as the backbone model.
For SWE-Bench, we used a CodeAct agent with claude-3-7-sonnet-20250219 as the backbone model, with added instructional constraints to organically introduce errors.
All traces were collected using OpenTelemetry, specifically the OpenInference standard, ensuring compatibility with real-world tracing and observability software.
### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
The source data was produced by AI agent systems based on OpenAI's o3-mini and Anthropic's Claude models, executing tasks from the GAIA and SWE-Bench datasets. The traces capture the execution flows of these agents attempting to solve information retrieval and software engineering tasks.
### Annotations
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
Four expert annotators with backgrounds in software engineering and log debugging annotated the agent traces.
Due to the lengthy traces (often exceeding maximum LLM context lengths), four independent rounds of verification were performed by ML researchers to ensure high quality.
Annotators iterated over each LLM and tool span individually and in context, marking span ID, error category, evidence, description, and impact level.
They also rated overall traces based on instruction adherence, plan optimality, security, and reliability.
Interannotator agreement was high, with only 5.63% of spans modified in SWE-Bench and 5.31% in GAIA during review.
### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
The annotations were created by four expert annotators with backgrounds in software engineering and log debugging, selected based on their age (18+) and expertise in computer science.
The annotations were further verified by four industry ML researchers to ensure high quality.
### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
The dataset does not contain personal identifiable information (PII) or sensitive content.
The traces were manually verified before being forwarded to annotators to ensure no explicit or biased content was included.
### Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The TRAIL dataset has the following limitations:
- It is primarily focused on text-only inputs and outputs.
- There is an imbalance in error categories, with Output Generation errors (particularly Formatting Errors and Instruction Non-compliance) accounting for nearly 42% of all errors.
-
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{deshpande2025trail,
title={TRAIL: Trace Reasoning and Agentic Issue Localization},
author={Darshan Deshpande and Varun Gangal and Hersh Mehta and Jitin Krishnan and Anand Kannappan and Rebecca Qian},
year={2025},
eprint={2505.08638},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.08638}
}
```
**APA:**
```
Deshpande, D., Gangal, V., Mehta, H., Krishnan, J., Kannappan, A., & Qian, R. (2025). TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv. https://arxiv.org/abs/2505.08638
```
## Dataset Card Authors
Darshan Deshpande
## Dataset Card Contact
darshan@patronus.ai
# 轨迹推理与智能体问题定位(Trace Reasoning and Agentic Issue Localization,TRAIL)
<img src="https://i.imgur.com/BDk2QcM.jpeg" width="30%" height="30%" alt="TRAIL 数据集示意图"/>
TRAIL 是一款包含148条带标注的AI智能体(AI Agent)执行轨迹的基准数据集,涵盖推理、执行与规划三类共841处错误。该数据集源自真实世界的软件工程与信息检索任务,即便当前最先进的大语言模型(Large Language Model,LLM)在其上的表现也不尽如人意,表现最优的模型仅取得11%的准确率,凸显了复杂智能体工作流轨迹调试的难度。
## 数据集详情
### 数据集概述
TRAIL(轨迹推理与智能体问题定位)是一款全新的基准数据集,旨在评估大语言模型在复杂AI智能体工作流中调试与识别错误的能力。本数据集包含148条经过精细标注的智能体执行轨迹,涵盖推理错误(如幻觉问题)、系统执行错误(如API故障)以及规划/协调错误三大类共841处独特错误。TRAIL 基于 GAIA 与 SWE-Bench 数据集的真实应用场景构建,包含单智能体与多智能体系统,可完成软件工程与信息检索领域的任务。相关研究表明,即便当前最先进的大语言模型在 TRAIL 上的表现也欠佳,表现最优的模型(Gemini-2.5-Pro)仅实现11%的联合准确率。该基准测试极具挑战性,因为其需要处理远超模型上下文窗口的超长上下文,同时要求生成大量输出内容,因此对于提升大语言模型评估复杂智能体系统的能力具有重要价值。
- **整理方:** Patronus AI
- **语言(自然语言处理):** 英语
- **许可证:** MIT 许可证
### 数据集来源
- **代码仓库:** https://github.com/patronus-ai/trail-benchmark
- **相关论文:** https://arxiv.org/abs/2505.08638
### 适用范围限制
不得将本数据集用于训练旨在自动化人工评估的系统(包括AI模型或其他系统)。本数据集仅用于此类系统的评估与基准测试。
## TRAIL 数据集上的模型性能
<img src="https://i.imgur.com/QeHGLAj.png" width="50%" height="50%" alt="TRAIL 数据集实验结果示意图"/>
## 数据集结构
本数据集包含148条轨迹(其中118条来自 GAIA,30条来自 SWE-Bench),总计1987个 OpenTelemetry 跨度(span),其中575个跨度至少包含一处错误。数据集采用轨迹级标注结构,包含每个已识别错误的跨度ID、错误类别类型、佐证证据、错误描述以及影响等级(低/中/高)。数据集按照 GAIA 基准测试(开放域搜索任务)与 SWE-Bench(软件工程漏洞修复任务)进行划分,确保覆盖不同智能体应用场景的生态效度。
## 数据集构建
### 构建动机
本数据集的构建旨在满足智能体工作流轨迹的鲁棒性与动态评估方法的日益增长的需求。随着智能体系统愈发复杂且在各领域得到广泛应用,现有的依赖人工、领域特定的轨迹分析方法难以实现规模化。TRAIL 提供了一种结构化的轨迹评估方式,搭配全面的错误分类体系,可实现对复杂智能体行为的系统化调试与错误分析。
### 源数据
### 数据收集与处理
本数据集使用来自 GAIA(开放域搜索任务)与 SWE-Bench Lite(软件工程漏洞修复任务)的纯文本数据实例构建。对于 GAIA 轨迹,我们使用基于 Hugging Face OpenDeepResearch 智能体,以 o3-mini-2025-01-31 作为主干模型。对于 SWE-Bench 任务,我们使用 CodeAct 智能体,以 claude-3-7-sonnet-20250219 作为主干模型,并通过添加指导性约束来自然引入错误。所有轨迹均通过 OpenTelemetry(具体为 OpenInference 标准)采集,确保与真实世界的追踪与可观测性软件兼容。
### 源数据生产者是谁?
源数据由基于 OpenAI o3-mini 与 Anthropic Claude 模型的AI智能体系统生成,这些智能体执行来自 GAIA 与 SWE-Bench 数据集的任务,其执行流程被完整记录为轨迹。
### 标注
### 标注流程
四位具备软件工程与日志调试背景的专家标注人员对智能体轨迹进行了标注。由于轨迹长度较长(时常超过大语言模型的最大上下文长度),我们由机器学习研究人员开展了四轮独立验证,以确保标注质量。标注人员逐一且结合上下文审核每个大语言模型与工具跨度,标记出跨度ID、错误类别、佐证证据、错误描述以及影响等级。此外,他们还基于指令遵循度、规划最优性、安全性与可靠性对整体轨迹进行评分。标注者间一致性较高,在 SWE-Bench 数据集中仅有5.63%的跨度在审核阶段被修改,GAIA 数据集的这一比例为5.31%。
### 标注人员是谁?
本次标注由四位具备软件工程与日志调试背景的专家标注人员完成,他们均年满18周岁且具备计算机科学相关专业知识。标注结果还由四位工业界机器学习研究人员进一步验证,以确保高质量输出。
### 个人与敏感信息
本数据集不包含个人可识别信息(Personally Identifiable Information,PII)或敏感内容。所有轨迹在提交给标注人员前均经过人工审核,确保未包含显性或带有偏见的内容。
### 偏见、风险与局限性
TRAIL 数据集存在以下局限性:
- 其主要聚焦于纯文本的输入与输出。
- 错误类别存在不平衡问题,输出生成类错误(尤其是格式错误与指令不遵从错误)占所有错误的近42%。
## 引用
**BibTeX 格式:**
@misc{deshpande2025trail,
title={TRAIL: Trace Reasoning and Agentic Issue Localization},
author={Darshan Deshpande and Varun Gangal and Hersh Mehta and Jitin Krishnan and Anand Kannappan and Rebecca Qian},
year={2025},
eprint={2505.08638},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.08638}
}
**APA 格式:**
Deshpande, D., Gangal, V., Mehta, H., Krishnan, J., Kannappan, A., & Qian, R. (2025). TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv. https://arxiv.org/abs/2505.08638
## 数据集卡片作者
Darshan Deshpande
## 数据集卡片联系方式
darshan@patronus.ai
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍

背景与挑战
背景概述
TRAIL是一个包含148个注释代理执行轨迹的基准数据集,涵盖841个错误,用于评估大型语言模型在复杂AI代理工作流中的调试能力。数据集来源于真实世界的软件工程和信息检索任务,具有较高的生态效度,且当前最先进的模型在该数据集上的表现仅为11%的准确率。
以上内容由遇见数据集搜集并总结生成



