Toolathlon-Trajectories

Name: Toolathlon-Trajectories
Creator: maas
Published: 2026-05-14 14:15:31
License: 暂无描述

魔搭社区2026-05-14 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/hkust-nlp/Toolathlon-Trajectories

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <p align="center"> <img src="./toolathlon.svg" alt="Logo" width="500" height="200"/> </p> # The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [![Website](https://img.shields.io/badge/Website-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white)](https://toolathlon.xyz) [![Discord](https://img.shields.io/badge/Join_Our_Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/Da3AaW4rVs) [![arXiv](https://img.shields.io/badge/Paper-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.25726) [![Hugging Face](https://img.shields.io/badge/Trajectories-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/datasets/hkust-nlp/Toolathlon-Trajectories) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/hkust-nlp/Toolathlon) </div> ## Dataset Overview This dataset contains the complete execution trajectories of 17 state-of-the-art language models evaluated on the Toolathlon benchmark. Toolathlon is a comprehensive benchmark for evaluating language agents on diverse, realistic, and long-horizon tasks. **Dataset Statistics:** - **51 trajectory files** (17 models × 3 runs each) - **~108 tasks per file** (some may be None depending on successful completions) - **Total trajectories:** more than 5,000 task execution records - **File format:** JSONL (one task trajectory per line) This dataset enables researchers to: - Analyze how different LLMs utilize tools to complete real-world tasks - Study agent reasoning patterns and tool-use strategies - Compare performance across different model families - Investigate failure modes and error recovery strategies ## Dataset Structure ### File Naming Convention Each file follows the naming pattern: ``` {model_name}_{run_number}.jsonl ``` - **`model_name`**: Model identifier (e.g., `gpt-5-high`, `claude-4.5-sonnet-0929`) - **`run_number`**: Run index (1, 2, or 3) - each model was evaluated 3 times independently **Example filenames:** - `gpt-5-high_1.jsonl` - GPT-5 High, first run - `claude-4.5-sonnet-0929_2.jsonl` - Claude 4.5 Sonnet, second run - `gemini-2.5-pro_3.jsonl` - Gemini 2.5 Pro, third run ### Models Included The dataset includes trajectories from the following 17 models: | Model Family | Model Names | |--------------|-------------| | **OpenAI GPT** | `gpt-5`, `gpt-5-high`, `gpt-5-mini` | | **OpenAI o-series** | `o3`, `o4-mini` | | **Anthropic Claude** | `claude-4-sonnet-0514`, `claude-4.5-sonnet-0929`, `claude-4.5-haiku-1001` | | **Grok** | `grok-4`, `grok-4-fast`, `grok-code-fast-1`| | **Google Gemini** | `gemini-2.5-pro`, `gemini-2.5-flash` | | **DeepSeek** | `deepseek-v3.2-exp` | | **Alibaba Qwen** | `qwen-3-coder` | | **Moonshot Kimi** | `kimi-k2-0905` | | **Zhipu GLM** | `glm-4.6` | ### Data Format Each JSONL file contains one JSON object per line, representing a single task execution trajectory: ```json { "modelname_run": "claude-4-sonnet-0514_1", "task_name": "find-alita-paper", "task_status": { "preprocess": "done", "running": "done", "evaluation": true }, "config": {...}, "messages": [...], "tool_calls": [...], "key_stats": {...}, "agent_cost": {...}, "key_stats": {...}, "request_id": xxx, "initial_run_time": xxx, "completion_time": xxx, } } ``` #### Field Descriptions To make it easier for the Hugging Face dataset viewer to display the data, we store all values as JSON‑serializable strings. Please remember to deserialize them after downloading the files: - **`task_name`**: Unique identifier for the task (e.g., `"train-ticket-plan"`, `"gdp-cr5-analysis"`) - **`task_status`**: Execution status information - `preprocess`: Whether preprocessing completed successfully (`"done"`,`"fail"`) - `running`: Whether task execution completed (`"done"`,`"fail"`,`"timeout"`,`"max_turn_exceeded"`) - `evaluation`: Boolean indicating if the task passed evaluation - **`config`**: Task configuration including: - `needed_mcp_servers`: List of MCP servers required (e.g., `["filesystem", "github", "snowflake"]`) - `needed_local_tools`: List of local tools available (e.g., `["web_search", "claim_done"]`) - `task_str`: The natural language task description given to the agent - `max_steps_under_single_turn_mode`: Maximum agent steps allowed - `system_prompts`: System prompts for agent and user simulator, though we do not have user simulator - And other configuration details... - **`messages`**: Full conversation history between agent and user simulator - Each message contains role, content, tool calls, and timestamps - **`tool_calls`**: List of all available tools in this task - Tool name, arguments, descriptions, etc. - **`key_stats`**: Summary statistics - Number of turns, tool calls, tokens used, execution time, etc. - **`agent_cost`**: LLM API costs for the agent model (this is not that precise as we do not consider prompt-caching in calculating this) - **`status`**: Final execution status - **`request_id`**, **`initial_run_time`**, **`completion_time`**: Execution metadata. ## Privacy & Anonymization All sensitive credentials and API tokens have been anonymized to protect privacy. The anonymization process: 1. **Identifies** all API keys, tokens, passwords, and credentials from the configuration 2. **Preserves** the first 1/6 and last 1/6 of each sensitive string (minimum 1 character each) 3. **Replaces** the middle portion with asterisks (`*`) **Example:** - Original: `ghp_JfjCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAqpKK` - Anonymized: `ghp_Jf****************************1lqpKK` ## Citation If you use this dataset in your research, please cite: ```bibtex @article{li2025toolathlon, title={The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution}, author={Junlong Li and Wenshuo Zhao and Jian Zhao and Weihao Zeng and Haoze Wu and Xiaochen Wang and Rui Ge and Yuxuan Cao and Yuzhen Huang and Wei Liu and Junteng Liu and Zhaochen Su and Yiyang Guo and Fan Zhou and Lueyang Zhang and Juan Michelini and Xingyao Wang and Xiang Yue and Shuyan Zhou and Graham Neubig and Junxian He}, year={2025}, eprint={2510.25726}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.25726}, } ``` ## License This dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).

<div align="center"> <p align="center"> <img src="./toolathlon.svg" alt="Logo" width="500" height="200"/> </p> # 工具十项全能（Tool Decathlon）：面向多样化、真实场景与长时序任务执行的AI智能体（AI Agent）评测基准 [![官网](https://img.shields.io/badge/Website-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white)](https://toolathlon.xyz) [![加入我们的Discord](https://img.shields.io/badge/Join_Our_Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/Da3AaW4rVs) [![论文](https://img.shields.io/badge/Paper-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.25726) [![Hugging Face 轨迹数据集](https://img.shields.io/badge/Trajectories-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/datasets/hkust-nlp/Toolathlon-Trajectories) [![GitHub 仓库](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/hkust-nlp/Toolathlon) </div> ## 数据集概览本数据集包含在工具十项全能（Toolathlon）基准上评估的17种前沿大语言模型（Large Language Model, LLM）的完整执行轨迹。Toolathlon是一款用于评测AI智能体完成多样化、真实场景与长时序任务的综合性评测基准。 **数据集统计信息：** - **51个轨迹文件**（17个模型 × 每个模型3次独立运行） - **每个文件约含108个任务**（部分任务可能因成功完成而显示为None） - **总轨迹数**：超过5000条任务执行记录 - **文件格式**：JSONL（每行对应一个任务轨迹）本数据集可供研究者开展以下研究： - 分析不同大语言模型如何利用工具完成现实世界任务 - 研究智能体的推理模式与工具使用策略 - 对比不同模型家族的性能表现 - 探究智能体的失败模式与错误恢复策略 ## 数据集结构 ### 文件命名规范每个文件遵循如下命名格式： {model_name}_{run_number}.jsonl - **`model_name`**：模型标识符（例如`gpt-5-high`、`claude-4.5-sonnet-0929`） - **`run_number`**：运行序号（1、2或3）——每个模型均进行3次独立评估 **示例文件名：** - `gpt-5-high_1.jsonl`：GPT-5 High 模型，第1次运行 - `claude-4.5-sonnet-0929_2.jsonl`：Claude 4.5 Sonnet 模型，第2次运行 - `gemini-2.5-pro_3.jsonl`：Gemini 2.5 Pro 模型，第3次运行 ### 包含的模型本数据集涵盖以下17种模型的执行轨迹： | 模型家族 | 模型名称 | |----------------|------------------------------| | **OpenAI GPT** | `gpt-5`、`gpt-5-high`、`gpt-5-mini` | | **OpenAI o系列** | `o3`、`o4-mini` | | **Anthropic Claude** | `claude-4-sonnet-0514`、`claude-4.5-sonnet-0929`、`claude-4.5-haiku-1001` | | **Grok** | `grok-4`、`grok-4-fast`、`grok-code-fast-1` | | **Google Gemini** | `gemini-2.5-pro`、`gemini-2.5-flash` | | **DeepSeek** | `deepseek-v3.2-exp` | | **Alibaba Qwen** | `qwen-3-coder` | | **Moonshot Kimi** | `kimi-k2-0905` | | **Zhipu GLM** | `glm-4.6` | ### 数据格式每个JSONL文件的每行均为一个JSON对象，代表单条任务执行轨迹： json { "modelname_run": "claude-4-sonnet-0514_1", "task_name": "find-alita-paper", "task_status": { "preprocess": "done", "running": "done", "evaluation": true }, "config": {...}, "messages": [...], "tool_calls": [...], "key_stats": {...}, "agent_cost": {...}, "key_stats": {...}, "request_id": xxx, "initial_run_time": xxx, "completion_time": xxx, } } #### 字段说明为便于Hugging Face数据集查看器展示数据，所有值均以JSON可序列化字符串形式存储。下载文件后请记得进行反序列化操作： - **`task_name`**：任务的唯一标识符（例如`"train-ticket-plan"`、`"gdp-cr5-analysis"`） - **`task_status`**：执行状态信息 - `preprocess`：预处理是否成功完成（可选值为`"done"`、`"fail"`） - `running`：任务执行是否完成（可选值为`"done"`、`"fail"`、`"timeout"`、`"max_turn_exceeded"`） - `evaluation`：布尔值，指示任务是否通过评估 - **`config`**：任务配置信息，包含： - `needed_mcp_servers`：所需的MCP服务器列表（例如`["filesystem", "github", "snowflake"]`） - `needed_local_tools`：可用的本地工具列表（例如`["web_search", "claim_done"]`） - `task_str`：提供给智能体的自然语言任务描述 - `max_steps_under_single_turn_mode`：单轮模式下允许的最大智能体步骤数 - `system_prompts`：智能体与用户模拟器的系统提示词（本数据集未包含用户模拟器相关实现） - 以及其他配置细节…… - **`messages`**：智能体与用户模拟器的完整对话历史 - 每条消息包含角色、内容、工具调用与时间戳 - **`tool_calls`**：本任务中所有可用工具的列表 - 包含工具名称、参数、描述等信息 - **`key_stats`**：汇总统计信息 - 包含交互轮次、工具调用次数、使用的Token数、执行时长等 - **`agent_cost`**：智能体模型的大语言模型API调用成本（本统计未考虑提示缓存带来的成本优化，因此存在一定误差） - **`status`**：最终执行状态 - **`request_id`**、**`initial_run_time`**、**`completion_time`**：执行元数据 ## 隐私与匿名化处理所有敏感凭证与API令牌均已进行匿名化处理以保护隐私。匿名化流程如下： 1. **识别**所有API密钥、令牌、密码与配置中的凭证信息 2. **保留**每个敏感字符串的前1/6与后1/6部分（每部分至少保留1个字符） 3. **替换**中间部分为星号（`*`） **示例：** - 原始字符串：`ghp_JfjCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAqpKK` - 匿名化后：`ghp_Jf****************************1lqpKK` ## 引用若您在研究中使用本数据集，请引用以下文献： bibtex @article{li2025toolathlon, title={The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution}, author={Junlong Li and Wenshuo Zhao and Jian Zhao and Weihao Zeng and Haoze Wu and Xiaochen Wang and Rui Ge and Yuxuan Cao and Yuzhen Huang and Wei Liu and Junteng Liu and Zhaochen Su and Yiyang Guo and Fan Zhou and Lueyang Zhang and Juan Michelini and Xingyao Wang and Xiang Yue and Shuyan Zhou and Graham Neubig and Junxian He}, year={2025}, eprint={2510.25726}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.25726}, } ## 许可证本数据集采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)许可证发布。

提供机构：

maas

创建时间：

2025-10-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集