fuvty/tau-bench-synthetic
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fuvty/tau-bench-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- tool-use
- function-calling
- synthetic
- tau-bench
- agent
pretty_name: tau-bench-synthetic
size_categories:
- 1K<n<10K
configs:
- config_name: tasks
data_files:
- split: train
path: tasks/train-*
- config_name: traj-GLM5
data_files:
- split: train
path: traj-GLM5/train-*
- config_name: sft-GLM5
data_files:
- split: train
path: sft-GLM5/train-*
---
# tau-bench-synthetic
Synthetic tool-use training data for [tau-bench](https://github.com/sierra-research/tau-bench), generated using a **GT-first task construction pipeline** with GLM-5 (via Fireworks API) as the trajectory generator.
## Overview
This dataset was built to train small LLMs (e.g., Qwen3-1.7B) on multi-turn tool-use tasks without using the original tau-bench evaluation set. The pipeline follows a GT-first approach: ground-truth actions are constructed programmatically from the database, then an LLM generates natural user scenarios that lead to those exact outcomes.
## Subsets
| Subset | Rows | Description |
|--------|------|-------------|
| | 280 | Generated task definitions with ground truth and evaluation criteria |
| | 1,464 | Complete GLM-5 agent trajectories (with reasoning) |
| | 4,270 | Training-ready SFT rows (per-round split, reasoning stripped) |
###
Task definitions in tau-bench format, covering **retail** (180) and **airline** (100) domains.
| Column | Type | Description |
|--------|------|-------------|
| uid=6007425(tianyuf) gid=6007425(tianyuf) groups=6007425(tianyuf),60136(catalyst) | string | Task identifier (e.g., ) |
| | string | or |
| | string (JSON) | Task description with purpose and relevant policies |
| | string (JSON) | Structured user scenario with persona and instructions |
| | string (JSON) | Initial database state for the task |
| | string (JSON) | Ground-truth actions, communication requirements, and NL assertions |
| | bool | Whether the task passed environment validation (183/280 = 65%) |
###
Complete multi-turn agent trajectories generated by GLM-5 with **reasoning enabled**. Each trajectory is one full episode of an agent solving a task.
| Column | Type | Description |
|--------|------|-------------|
| | string | Task identifier |
| | int | Trial number (up to 8 trials per task) |
| | string | or |
| | string (JSON) | Full conversation including in assistant messages |
| | string (JSON) | Tool schemas available to the agent |
| | float | 1.0 = task completed successfully, 0.0 = failed |
| | string | How the episode ended (e.g., , ) |
| | float | Wall-clock time for the episode |
Pass rates: **795/1,000 retail (80%)**, **405/464 airline (87%)**.
###
Training-ready rows derived from passing trajectories. Each row is a single assistant "round" from a multi-turn conversation, created via .
| Column | Type | Description |
|--------|------|-------------|
| | string (JSON) | Conversation up to this round, **reasoning stripped** |
| | string (JSON) | Tool schemas for this domain |
| | int | Round index within the trajectory |
| | int | Total rounds in the source trajectory |
| | string | Source task identifier |
| | int | Source trial number |
| | string | or |
## Pipeline
### Task Templates
**Retail (9 templates):** CancelPending, ReturnDelivered, ExchangeDelivered, ModifyPendingItems, ModifyPendingAddress, ModifyPendingPayment, ModifyUserAddress, CancelAndReturn (compound), AddressFixAndModifyItems (compound).
**Airline (5 templates):** CancelReservation, ModifyFlights, AddBaggage, RefusalBasicEconomyModify (no-write), RefusalUncancellable (no-write).
### GT-First Approach
Each template:
1. **Samples** valid entity combinations from the tau-bench database
2. **Constructs** exact expected API calls (ground truth) programmatically
3. **Generates** structured hints (auth info, reason for call, preferences) without leaking tool names or IDs
4. **Calls GLM-5** to produce a natural user scenario from the hints
This ensures tasks are **easy to verify** (exact GT match) but **hard to solve** (natural language with implicit requirements).
## Usage
## Generation Details
- **Generator model:** GLM-5 via Fireworks API
- **Reasoning:** Enabled during trajectory generation, stripped for SFT training
- **Validation:** Tasks validated through the tau-bench environment (reward=1 required)
- **Trials:** 8 per validated task during trajectory collection
- **Domains:** retail (15 tools), airline (varies by template)
- **Total API budget:** ~2,024 trajectory episodes
## Related
- [tau-bench](https://github.com/sierra-research/tau-bench) — The original benchmark
- [Cache-to-Cache (C2C)](https://arxiv.org/abs/2510.03215) — The project this data was built for
提供机构:
fuvty



