fuvty/tau-bench-synthetic

Name: fuvty/tau-bench-synthetic
Creator: fuvty
Published: 2026-04-08 19:06:05
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/fuvty/tau-bench-synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - text-generation tags: - tool-use - function-calling - synthetic - tau-bench - agent pretty_name: tau-bench-synthetic size_categories: - 1K<n<10K configs: - config_name: tasks data_files: - split: train path: tasks/train-* - config_name: traj-GLM5 data_files: - split: train path: traj-GLM5/train-* - config_name: sft-GLM5 data_files: - split: train path: sft-GLM5/train-* --- # tau-bench-synthetic Synthetic tool-use training data for [tau-bench](https://github.com/sierra-research/tau-bench), generated using a **GT-first task construction pipeline** with GLM-5 (via Fireworks API) as the trajectory generator. ## Overview This dataset was built to train small LLMs (e.g., Qwen3-1.7B) on multi-turn tool-use tasks without using the original tau-bench evaluation set. The pipeline follows a GT-first approach: ground-truth actions are constructed programmatically from the database, then an LLM generates natural user scenarios that lead to those exact outcomes. ## Subsets | Subset | Rows | Description | |--------|------|-------------| | | 280 | Generated task definitions with ground truth and evaluation criteria | | | 1,464 | Complete GLM-5 agent trajectories (with reasoning) | | | 4,270 | Training-ready SFT rows (per-round split, reasoning stripped) | ### Task definitions in tau-bench format, covering **retail** (180) and **airline** (100) domains. | Column | Type | Description | |--------|------|-------------| | uid=6007425(tianyuf) gid=6007425(tianyuf) groups=6007425(tianyuf),60136(catalyst) | string | Task identifier (e.g., ) | | | string | or | | | string (JSON) | Task description with purpose and relevant policies | | | string (JSON) | Structured user scenario with persona and instructions | | | string (JSON) | Initial database state for the task | | | string (JSON) | Ground-truth actions, communication requirements, and NL assertions | | | bool | Whether the task passed environment validation (183/280 = 65%) | ### Complete multi-turn agent trajectories generated by GLM-5 with **reasoning enabled**. Each trajectory is one full episode of an agent solving a task. | Column | Type | Description | |--------|------|-------------| | | string | Task identifier | | | int | Trial number (up to 8 trials per task) | | | string | or | | | string (JSON) | Full conversation including in assistant messages | | | string (JSON) | Tool schemas available to the agent | | | float | 1.0 = task completed successfully, 0.0 = failed | | | string | How the episode ended (e.g., , ) | | | float | Wall-clock time for the episode | Pass rates: **795/1,000 retail (80%)**, **405/464 airline (87%)**. ### Training-ready rows derived from passing trajectories. Each row is a single assistant "round" from a multi-turn conversation, created via . | Column | Type | Description | |--------|------|-------------| | | string (JSON) | Conversation up to this round, **reasoning stripped** | | | string (JSON) | Tool schemas for this domain | | | int | Round index within the trajectory | | | int | Total rounds in the source trajectory | | | string | Source task identifier | | | int | Source trial number | | | string | or | ## Pipeline ### Task Templates **Retail (9 templates):** CancelPending, ReturnDelivered, ExchangeDelivered, ModifyPendingItems, ModifyPendingAddress, ModifyPendingPayment, ModifyUserAddress, CancelAndReturn (compound), AddressFixAndModifyItems (compound). **Airline (5 templates):** CancelReservation, ModifyFlights, AddBaggage, RefusalBasicEconomyModify (no-write), RefusalUncancellable (no-write). ### GT-First Approach Each template: 1. **Samples** valid entity combinations from the tau-bench database 2. **Constructs** exact expected API calls (ground truth) programmatically 3. **Generates** structured hints (auth info, reason for call, preferences) without leaking tool names or IDs 4. **Calls GLM-5** to produce a natural user scenario from the hints This ensures tasks are **easy to verify** (exact GT match) but **hard to solve** (natural language with implicit requirements). ## Usage ## Generation Details - **Generator model:** GLM-5 via Fireworks API - **Reasoning:** Enabled during trajectory generation, stripped for SFT training - **Validation:** Tasks validated through the tau-bench environment (reward=1 required) - **Trials:** 8 per validated task during trajectory collection - **Domains:** retail (15 tools), airline (varies by template) - **Total API budget:** ~2,024 trajectory episodes ## Related - [tau-bench](https://github.com/sierra-research/tau-bench) — The original benchmark - [Cache-to-Cache (C2C)](https://arxiv.org/abs/2510.03215) — The project this data was built for

提供机构：

fuvty

5,000+

优质数据集

54 个

任务类型

进入经典数据集