vibrantlabsai/tau2-infinity
收藏Hugging Face2026-04-21 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/vibrantlabsai/tau2-infinity
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- benchmark
- tool-use
- agent
- function-calling
- airline
size_categories:
- n<1K
---
# tau2-infinity
An adaptive benchmark for evaluating LLM tool-use agents on airline customer service tasks. Generated using EnvScaler by VibrantLabs.
## Overview
Each task requires an agent to transform an initial database state **S_0** into a golden final state **S*** by executing a sequence of tool calls (flight searches, bookings, cancellations, updates, etc.). Tasks were adaptively generated to target specific difficulty levels against a calibration model.
| Property | Value |
|----------|-------|
| Number of tasks | 13 |
| Target pass rate | [0.2, 0.6] |
| Achieved avg pass rate | 0.354 |
| Calibration model | `fireworks_ai/accounts/vibrantlabs/deployments/bv8h7e5g` |
| Evaluation runs per task | 5 |
| Total iterations to collect | 50 |
| Collection rate | 26.0% |
## Dataset Schema
| Column | Type | Description |
|--------|------|-------------|
| `task_id` | string | Unique task identifier |
| `task_description` | string | Natural language task the agent must complete |
| `tools` | JSON string | Tool specifications available to the agent |
| `database` | JSON string | Initial database state (S_0) |
| `golden_trajectory` | JSON string | Resolved DAG with oracle tool calls and expected outputs |
| `pass_rate` | float | Pass rate achieved by the calibration model (0.0 - 1.0) |
## Tasks
| Task ID | Pass Rate | Failure Mode |
|---------|-----------|-------------|
| 010 | 0.600 | |
| 015 | 0.200 | |
| 018 | 0.200 | |
| 019 | 0.400 | |
| 027 | 0.200 | |
| 031 | 0.400 | |
| 034 | 0.200 | |
| 039 | 0.400 | |
| 040 | 0.600 | |
| 041 | 0.200 | |
| 042 | 0.200 | |
| 044 | 0.600 | |
| 050 | 0.600 | |
## Failure Mode Analysis
## Usage
```python
from datasets import load_dataset
ds = load_dataset("vibrantlabsai/tau2-infinity", split="test")
for task in ds:
print(task["task_id"], task["task_description"][:100])
# Parse structured fields
import json
tools = json.loads(task["tools"])
database = json.loads(task["database"])
golden = json.loads(task["golden_trajectory"])
```
## License
Apache 2.0
提供机构:
vibrantlabsai



