wuwendy/agent_trajectories

Name: wuwendy/agent_trajectories
Creator: wuwendy
Published: 2026-03-26 15:51:19
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/wuwendy/agent_trajectories

下载链接

链接失效反馈

官方服务：

资源简介：

# Agent Trajectories Dataset — Processing & Format Documentation ## Overview | Benchmark | Records | Models | Passes | Avg Turns | Reward Type | Avg Reward | |---|---|---|---|---|---|---| | tau2bench | 1,000 | 5 | 4 | 32.3 | binary (0/1) | 0.385 | | swebench | 998 | 5 | 4 | 81.1 | binary (0/1) | 0.166 | | terminalbench | 1,580 | 5 | 4 | 31.9 | binary (0/1) | 0.175 | | mathhay | 1,500 | 5 | 4 | 4.2 | binary (0/1) | 0.410 | | search | 3,980 | 5 | 4 | 19.7 | binary (0/1) | 0.181 | | mcpbench | 1,040 | 5 | 4 | 26.0 | continuous (0–10) | 2.936 | | **Total** | **10,098** | | | | | | **Models**: DeepSeek-R1, DeepSeek-V3.2, Gemini-2.5-Flash, Qwen3-235B, Qwen3-Next --- ## Source Data Raw data lives in `parallel_scaling_results/`, organized as: ``` {Model}_{benchmark}_distraction_{scope}/ pass_{1..4}/ evaluations/ # eval results (reward, test output, etc.) traces/ # agent conversation traces (messages) ``` Each task was run 4 times (4 passes) per model under a **distraction condition** — irrelevant content was injected into the agent's context to test robustness. --- ## Processing Pipeline ### Step 1: Load & Pair Files For each `(model, benchmark, pass)`: - **Eval file** → reward, test results, benchmark-specific metadata - **Trace file** → conversation messages (the agent trajectory) Files are paired by matching filename. The search benchmark required special handling (see below). ### Step 2: Clean Distraction Artifacts The distraction condition injected two types of artifacts into **user messages**: | Artifact | Description | Example | |---|---|---| | `<reasoning>...</reasoning>` | Fake reasoning blocks injected into user turns | Model's internal reasoning inserted as distraction | | `<tool_response_begin>...<tool_response_end>` | Fake tool responses injected into user turns | Fabricated tool output to mislead the agent | **Cleaning strategy** (zero-hallucination guarantee): 1. Regex-match only closed tag pairs: `<reasoning>.*?</reasoning>` and `<tool_response_begin>.*?<tool_response_end>` 2. Remove matched content — pure deletion, no content generation 3. Clean up leftover separator lines (`---`) and excess newlines 4. Log every removal in `cleaning_info` field (message index, position, length) 5. All other content is preserved byte-identical to source **What is NOT cleaned** (preserved as-is): - DeepSeek special tokens (`<｜tool▁calls▁begin｜>`, `<｜tool▁sep｜>`, etc.) — these are legitimate model output - Any `<reasoning>` or similar tags in **assistant** messages — these are part of the model's own response format - Super long messages — no truncation applied **Cleaning stats**: 191 records affected (all in tau2bench), 542 reasoning blocks + 2 tool_response blocks removed. ### Step 3: Extract & Assemble Record Each record is assembled from trace + eval: | Field | Source | Description | |---|---|---| | `id` | Generated | Unique ID: `{benchmark}__{model}__{domain}__{task_id}__pass{n}` | | `benchmark` | Config | One of: tau2bench, swebench, terminalbench, mathhay, search, mcpbench | | `domain` | Eval/Trace | Task domain (see per-benchmark details below) | | `task_id` | Eval/Trace | Original task identifier | | `source_model` | Config | Model that generated this trajectory | | `pass` | Config | Pass number (1–4) | | `messages` | Trace | Cleaned conversation in standard chat format | | `num_turns` | Computed | Number of messages | | `reward` | Eval | Task reward (benchmark-specific, see below) | | `eval_details` | Eval | Full eval metadata (all original eval fields) | | `cleaning_info` | Computed | List of removed artifacts, or null if no cleaning was done | ### Step 4: Output - **JSONL**: one JSON object per line, human-readable - **Parquet**: messages/eval_details/cleaning_info stored as JSON strings, all scalar fields as native types - Split by benchmark (6 files each format) --- ## Message Format (Standard Chat) Messages follow the OpenAI-style chat format: ```json [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "Please help me with..."}, {"role": "assistant", "content": "", "tool_calls": [ {"id": "call_123", "type": "function", "function": {"name": "get_order", "arguments": "{\"order_id\": \"#W123\"}"}} ]}, {"role": "tool", "content": "{\"status\": \"delivered\"}", "tool_call_id": "call_123"}, {"role": "assistant", "content": "Your order has been delivered."} ] ``` **Roles**: `system`, `user`, `assistant`, `tool` - `assistant` messages may include `tool_calls` (list of function calls) - `tool` messages contain the tool execution result --- ## Per-Benchmark Details ### tau2bench (1,000 records) - **Task**: Agent interacts with simulated users to complete customer service tasks (booking, cancellation, etc.) - **Domains**: airline (200), retail (400), telecom (400) - **Reward source**: `eval_data["reward_info"]["reward"]` — binary 0/1 - **Success rate**: 38.5% - **eval_details includes**: reward_info (with reward_breakdown, action_checks, db_check), termination_reason, duration - **Cleaning**: 191 records had distraction artifacts removed ### swebench (998 records) - **Task**: Agent resolves real GitHub issues by writing code patches - **Domains**: 12 Python repos (django, astropy, sympy, matplotlib, etc.) - **Reward source**: `eval_data["reward"]` — binary 0/1 (resolved or not) - **Success rate**: 16.6% - **eval_details includes**: patch, gold_patch, test_output, report (resolved, patch_exists, tests_status) - **Note**: 78 records have empty traces (agent errored/timed out before producing messages) ### terminalbench (1,580 records) - **Task**: Agent completes terminal/system tasks in Docker containers - **Domains**: 10 categories (software-engineering, system-administration, security, etc.) - **Reward source**: `eval_data["reward"]` — binary 0/1 - **Success rate**: 17.5% - **eval_details includes**: test_output, test_passed, status, execution_time - **Note**: 24 records have empty traces ### mathhay (1,500 records) - **Task**: Agent answers math questions requiring information retrieval from a large context (needle-in-haystack) - **Domain**: 3s3d (3 sub-questions, 3 distractor documents) - **Reward source**: `eval_data["score"]` — binary 0/1 - **Success rate**: 41.0% - **eval_details includes**: question, golden_answer, predicted_answer, raw_response, numerical_match, llm_judge, context_length, num_relevant_docs, num_irrelevant_docs - **Note**: Very long messages (~554K chars) due to large context — preserved without truncation ### search (3,980 records) - **Task**: Agent searches the web to answer complex questions - **Domains**: browsecomp (2,480), webvoyager (1,300), mind2web (200) - **Reward source**: `eval_data["score"]` — binary 0/1 - **Success rate**: 18.1% - **eval_details includes**: question, answer, ground_truth, search count, script count, context lengths, total_tokens - **Special handling**: Eval files (`result_N.json`) and trace files (`{dataset}_N.json`) have different naming. Mapping was resolved via `summary.json`. DeepSeek-V3.2 had a different summary format (`task_id="result_154"` vs `"154"`) — handled by probing trace file prefixes. ### mcpbench (1,040 records) - **Task**: Agent uses MCP (Model Context Protocol) tool servers to complete complex multi-tool tasks - **Domains**: 52 unique server/server-combination names - **Reward source**: `eval_data["evaluation"]["task_completion_score"]` — continuous 0–10 - **Score distribution**: min=0.0, max=8.43, mean=2.94, 266 unique values - **eval_details includes**: Full evaluation sub-scores: - task_fulfillment, grounding, tool_appropriateness, parameter_accuracy - dependency_awareness, parallelism_and_efficiency - task_completion_score, tool_selection_score, planning_effectiveness_and_efficiency_score - input_schema_compliance, valid_tool_name_rate, execution_success_rate - **Note**: 18 records have empty traces --- ## Quality Assurance Three rounds of automated audits were performed before publication: | Check | Scope | Result | |---|---|---| | Messages exact match vs source | Sampled + full (swebench) | PASS | | Reward match vs source eval | All 10,098 records | PASS | | Task ID / domain match | All 10,098 records | PASS | | Cleaning safety (deletion only, no hallucination) | All 191 cleaned records | PASS | | DeepSeek special tokens preserved | All 10,098 records | PASS | | No residual distraction tags | All 10,098 records | PASS | | Long messages not truncated | Top 10 verified (554K chars) | PASS | | Empty traces have valid eval/reward | All 121 empty-trace records | PASS | | Parquet-JSONL consistency | All 6 benchmarks | PASS | | Schema consistency (11 fields) | All 10,098 records | PASS | | No duplicate IDs | All 6 benchmarks | PASS | | mcpbench raw continuous scores | All 1,040 records | PASS | --- ## Usage ```python from datasets import load_dataset # Load all benchmarks ds = load_dataset("wuwendy/agent_trajectories") # Load a specific benchmark import json with open("tau2bench.jsonl") as f: records = [json.loads(line) for line in f] # Access a record rec = records[0] print(rec["messages"]) # conversation trajectory print(rec["reward"]) # task reward print(rec["eval_details"]) # full eval metadata ``` --- ## Files ``` dataset_clean/ tau2bench.jsonl (32 MB) tau2bench.parquet (9.6 MB) swebench.jsonl (99 MB) swebench.parquet (27 MB) terminalbench.jsonl (53 MB) terminalbench.parquet (15 MB) mathhay.jsonl (727 MB) mathhay.parquet (374 MB) search.jsonl (103 MB) search.parquet (42 MB) mcpbench.jsonl (68 MB) mcpbench.parquet (23 MB) tau2bench_cleaning_report.json search_cleaning_report.json ```

提供机构：

wuwendy

5,000+

优质数据集

54 个

任务类型

进入经典数据集