inclusionAI/AReaL-tau2-data
收藏Hugging Face2026-03-02 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/inclusionAI/AReaL-tau2-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
configs:
- config_name: default
data_files:
- split: sft
path: tau2_sft_train.jsonl
- split: rl
path: tau2_rl_train.jsonl
task_categories:
- text-generation
tags:
- tool-use
- agent
- multi-turn
- reinforcement-learning
- tau2-bench
- AReaL
language:
- en
size_categories:
- 10K<n<100K
---
# AReaL-tau2-data
Synthetic training data for multi-turn interactive tool-using agents, generated by **SEA**, a self-evolving multi-agent data engine. This dataset is used to train [AReaL-SEA-235B-A22B](https://huggingface.co/inclusionAI/AReaL-SEA-235B-A22B), achieving state-of-the-art results on [τ²-bench](https://github.com/sierra-research/tau2-bench).
- **Paper**: [From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents](https://arxiv.org/abs/2601.22607)
- **Training Framework**: [AReaL](https://github.com/inclusionAI/AReaL)
- **Benchmark**: [τ²-bench](https://github.com/sierra-research/tau2-bench)
## Dataset Overview
The dataset covers three customer-service domains from τ²-bench: **Airline**, **Retail**, and **Telecom**. It contains two splits designed for a two-stage post-training pipeline (SFT → RL):
| File | Purpose | Samples | Airline | Retail | Telecom |
|---|---|---|---|---|---|
| `tau2_sft_train.jsonl` | Supervised Fine-Tuning | 33,531 | 12,842 | 11,395 | 9,294 |
| `tau2_rl_train.jsonl` | Reinforcement Learning | 1,982 | 1,148 | 563 | 271 |
Additionally, `tau2_rl_database/` contains the environment database snapshots required for RL rollouts.
## SFT Data Format
Each line in `tau2_sft_train.jsonl` is a JSON object representing a single training example (one assistant turn in context):
```json
{
"messages": [
{"role": "system", "content": "<system prompt with policy and tools>"},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."},
{"role": "tool", "content": "..."},
...
],
"answer": {
"role": "assistant",
"content": "...",
"thinking": "...",
"tool_calls": [...]
},
"metadata": {
"source_dialog_id": "airline_dialog_42",
"turn_index": 2,
"reason_for_call": "...",
"scenario_id": "scenario_42",
"correct": 1,
"reward": 1.0
}
}
```
| Field | Description |
|---|---|
| `messages` | Conversation history up to the current turn (system, user, assistant, tool messages) |
| `answer` | The ground-truth assistant response to train on, including chain-of-thought (`thinking`) and tool calls |
| `metadata` | Provenance info: source dialog, turn index, task description, and correctness label |
## RL Data Format
Each line in `tau2_rl_train.jsonl` is a JSON object representing a complete task specification. The format is largely compatible with τ²-bench tasks, with one critical addition: the **`db_path` field**.
```json
{
"id": "airline_1",
"description": {"purpose": "Customer service simulation for airline domain"},
"user_scenario": {
"instructions": {
"task_instructions": "YOUR GOAL: ...",
"domain": "airline",
"reason_for_call": "...",
"known_info": "You are Mia Li. Your user id is mia_li_3668. ..."
}
},
"evaluation_criteria": "{\"actions\": [...], \"communicate_info\": [...]}",
"db_path": "tau2_rl_database/tau2_airline_new_db_3.json"
}
```
Telecom tasks may additionally include `initial_state` (environment initialization actions) and `ticket` (customer support ticket description).
| Field | Description |
|---|---|
| `id` | Unique identifier, prefixed by domain (`airline_*`, `retail_*`, `telecom_*`) |
| `description` | Task metadata (purpose, type, difficulty) |
| `user_scenario` | User simulator instructions: task goals, persona, known information, behavioral guidance |
| `evaluation_criteria` | JSON string containing ground-truth action sequences and assertion-based verification functions, used as the reward signal for RL |
| **`db_path`** | **Path to the environment database snapshot for this task. This is critical — each RL task operates on a specific database state, and the agent's tool calls execute against this database during rollouts. Different tasks may point to different database files to ensure diverse environment states.** |
| `initial_state` | *(Telecom only)* Initialization actions to set up the user/assistant environment before the conversation starts |
| `ticket` | *(Telecom only)* Customer support ticket that provides the assistant with initial context |
### Why `db_path` Matters
In τ²-bench, the environment state (user accounts, reservations, flight schedules, product inventory, etc.) determines whether a task is solvable and what the correct tool-call sequence should be. Unlike the original τ²-bench where all tasks share a single default database, **our RL data uses multiple database snapshots** (`tau2_rl_database/`) to create diverse training environments. This design:
1. **Enables scalable task generation** — new tasks can be created by varying both the user scenario and the underlying database state.
2. **Prevents overfitting** — the agent must generalize across different environment configurations rather than memorizing a fixed database.
3. **Supports verifiable rewards** — the verification functions in `evaluation_criteria` check the final database state after rollout, so the correct database must be loaded for accurate reward computation.
## Environment Databases
```
tau2_rl_database/
├── tau2_airline_db.json # Original airline database
├── tau2_airline_new_db_1.json # Extended airline database variants
├── tau2_airline_new_db_2.json
├── tau2_airline_new_db_3.json
├── tau2_retail_new_db_1.json # Retail database variants
├── tau2_retail_new_db_2.json
├── tau2_retail_new_db_3.json
├── tau2_retail_new_db_4.json
└── tau2_telecom_db.toml # Telecom database (TOML format)
```
## Citation
```bibtex
@article{gao2025sea,
title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
journal={arXiv preprint arXiv:2601.22607},
year={2025}
}
@article{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
journal={arXiv preprint arXiv:2505.24298},
year={2025}
}
```
---
许可证:Apache 2.0
配置项:
- 配置名称:默认
数据文件:
- 拆分集:SFT(监督微调,Supervised Fine-Tuning)
路径:tau2_sft_train.jsonl
- 拆分集:RL(强化学习,Reinforcement Learning)
路径:tau2_rl_train.jsonl
任务类别:
- 文本生成
标签:
- 工具使用
- AI智能体(AI Agent)
- 多轮交互
- 强化学习
- τ²-bench
- AReaL
语言:
- 英语
数据规模:
- 10K < n < 100K
---
# AReaL-tau2数据集
本数据集为多轮交互式工具使用型AI智能体的合成训练数据,由**SEA**——一款自进化多智能体数据引擎——生成。本数据集用于训练[AReaL-SEA-235B-A22B](https://huggingface.co/inclusionAI/AReaL-SEA-235B-A22B),在[τ²-bench](https://github.com/sierra-research/tau2-bench)上取得了当前最优性能。
- **论文**:[《从自进化合成数据到可验证奖励强化学习:面向多轮交互式工具使用型智能体的后训练方法》](https://arxiv.org/abs/2601.22607)
- **训练框架**:[AReaL](https://github.com/inclusionAI/AReaL)
- **评测基准**:[τ²-bench](https://github.com/sierra-research/tau2-bench)
## 数据集概览
本数据集覆盖τ²-bench中的三个客服领域:**航空**、**零售**与**电信**。数据集包含两个拆分集,适配两阶段后训练流程(SFT→RL):
| 文件名 | 用途 | 总样本数 | 航空领域 | 零售领域 | 电信领域 |
|---|---|---|---|---|---|
| `tau2_sft_train.jsonl` | 监督微调 | 33,531 | 12,842 | 11,395 | 9,294 |
| `tau2_rl_train.jsonl` | 强化学习 | 1,982 | 1,148 | 563 | 271 |
此外,`tau2_rl_database/` 目录包含强化学习推演所需的环境数据库快照。
## 监督微调数据格式
`tau2_sft_train.jsonl` 中的每一行均为一个JSON对象,代表单条训练样本(即上下文语境下的一轮助手回复):
json
{
"messages": [
{"role": "system", "content": "<system prompt with policy and tools>"},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."},
{"role": "tool", "content": "..."},
...
],
"answer": {
"role": "assistant",
"content": "...",
"thinking": "...",
"tool_calls": [...]
},
"metadata": {
"source_dialog_id": "airline_dialog_42",
"turn_index": 2,
"reason_for_call": "...",
"scenario_id": "scenario_42",
"correct": 1,
"reward": 1.0
}
}
| 字段 | 说明 |
|---|---|
| `messages` | 当前轮次之前的对话历史(包含系统提示、用户、助手与工具消息) |
| `answer` | 用于训练的真实助手回复,包含思维链(`thinking`)与工具调用内容 |
| `metadata` | 数据溯源信息:源对话ID、轮次索引、任务描述与正确性标签 |
## 强化学习数据格式
`tau2_rl_train.jsonl` 中的每一行均为一个JSON对象,代表完整的任务规范。该格式与τ²-bench任务基本兼容,但新增了一个关键字段:**`db_path`**。
json
{
"id": "airline_1",
"description": {"purpose": "Customer service simulation for airline domain"},
"user_scenario": {
"instructions": {
"task_instructions": "YOUR GOAL: ...",
"domain": "airline",
"reason_for_call": "...",
"known_info": "You are Mia Li. Your user id is mia_li_3668. ..."
}
},
"evaluation_criteria": "{"actions": [...], "communicate_info": [...]}",
"db_path": "tau2_rl_database/tau2_airline_new_db_3.json"
}
电信领域的任务可能额外包含`initial_state`(环境初始化操作)与`ticket`(客户支持工单描述)字段。
| 字段 | 说明 |
|---|---|
| `id` | 唯一标识符,以领域作为前缀(`airline_*`、`retail_*`、`telecom_*`) |
| `description` | 任务元数据(用途、类型、难度) |
| `user_scenario` | 用户模拟器指令:包含任务目标、角色人设、已知信息与行为规范 |
| `evaluation_criteria` | JSON字符串,包含真实动作序列与基于断言的验证函数,用作强化学习的奖励信号 |
| **`db_path`** | **当前任务对应的环境数据库快照路径。该字段至关重要——每个强化学习任务均基于特定的数据库状态运行,智能体的工具调用将在推演过程中针对该数据库执行。不同任务可指向不同的数据库文件,以保证环境状态的多样性。** |
| `initial_state` | *(仅电信领域适用)* 对话开始前用于初始化用户/助手环境的操作 |
| `ticket` | *(仅电信领域适用)* 为助手提供初始上下文的客户支持工单 |
### `db_path` 字段的重要性
在τ²-bench中,环境状态(用户账户、预订信息、航班时刻表、商品库存等)决定了任务是否可解,以及正确的工具调用序列应当为何。与原始τ²-bench所有任务共享单个默认数据库的设计不同,**本数据集的强化学习数据使用了多个数据库快照**(`tau2_rl_database/`目录)以构建多样化的训练环境。该设计具备以下优势:
1. **支持可扩展的任务生成**——可通过调整用户场景与底层数据库状态来创建新任务。
2. **避免过拟合**——智能体需要在不同的环境配置中进行泛化,而非记忆固定的数据库内容。
3. **支持可验证的奖励计算**——`evaluation_criteria`中的验证函数会在推演结束后检查最终数据库状态,因此必须加载正确的数据库才能得到准确的奖励值。
## 环境数据库
tau2_rl_database/
├── tau2_airline_db.json # 原始航空领域数据库
├── tau2_airline_new_db_1.json # 扩展版航空领域数据库变体
├── tau2_airline_new_db_2.json
├── tau2_airline_new_db_3.json
├── tau2_retail_new_db_1.json # 扩展版零售领域数据库变体
├── tau2_retail_new_db_2.json
├── tau2_retail_new_db_3.json
├── tau2_retail_new_db_4.json
└── tau2_telecom_db.toml # 电信领域数据库(TOML格式)
## 引用格式
bibtex
@article{gao2025sea,
title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
journal={arXiv preprint arXiv:2601.22607},
year={2025}
}
@article{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
journal={arXiv preprint arXiv:2505.24298},
year={2025}
}
提供机构:
inclusionAI



