changdae/tau2-uq-artifacts
收藏Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/changdae/tau2-uq-artifacts
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- uncertainty-quantification
- agent
pretty_name: tau2-bench UQ Artifacts
size_categories:
- 1K<n<10K
configs:
- config_name: trajectories
data_files:
- split: gpt41_kimik25_airline
path: trajectories/gpt41_kimik25_airline.json
- split: gpt41_kimik25_retail
path: trajectories/gpt41_kimik25_retail.json
- split: gpt41_kimik25_telecom
path: trajectories/gpt41_kimik25_telecom.json
- split: kimik25_airline
path: trajectories/Kimi-K2.5_airline.json
- split: kimik25_retail
path: trajectories/Kimi-K2.5_retail.json
- split: kimik25_telecom
path: trajectories/Kimi-K2.5_telecom.json
- config_name: logprobs_gpt41_kimik25
data_files:
- split: airline
path: logprobs/gpt41_kimik25/logprobs_airline_*.jsonl
- split: retail
path: logprobs/gpt41_kimik25/logprobs_retail_*.jsonl
- split: telecom
path: logprobs/gpt41_kimik25/logprobs_telecom_*.jsonl
- config_name: logprobs_kimik25
data_files:
- split: airline
path: logprobs/Kimi-K2.5/logprobs_airline_*.jsonl
- split: retail
path: logprobs/Kimi-K2.5/logprobs_retail_*.jsonl
- split: telecom
path: logprobs/Kimi-K2.5/logprobs_telecom_*.jsonl
---
# tau2-bench UQ Artifacts
Interaction trajectories and token-level log-probability measurements from conversational customer service agent evaluations on [tau2-bench](https://github.com/sierra-research/tau2-bench), collected as part of the uncertainty quantification (UQ) pipeline. Used for analyses in the paper "[Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities](https://arxiv.org/pdf/2602.05073)" under the [agentuq codebase](https://github.com/deeplearning-wisc/agentuq).
## Dataset Overview
This dataset contains two types of artifacts:
1. **Trajectories** -- Full multi-turn agent-user-tool interaction traces with embedded UQ summaries
2. **Token logprobs** -- Per-token log-probabilities and uncertainty measurements for every generation
### Configurations
| Config | Agent LLM | User Simulator LLM | Domains | Description |
|--------|-----------|---------------------|---------|-------------|
| `trajectories` | GPT-4.1 / Kimi-K2.5 | Kimi-K2.5 | airline, retail, telecom | Full simulation results with rewards and UQ summaries |
| `logprobs_gpt41_kimik25` | GPT-4.1 | Kimi-K2.5 | airline, retail, telecom | Token-level logprobs for each generation |
| `logprobs_kimik25` | Kimi-K2.5 | Kimi-K2.5 | airline, retail, telecom | Token-level logprobs for each generation |
### Statistics
| Config | Domain | Tasks | Trajectories | File Size |
|--------|--------|-------|-------------|-----------|
| trajectories / gpt41_kimik25 | airline | 50 | 50 | 3.2 MB |
| trajectories / gpt41_kimik25 | retail | 114 | 114 | 7.6 MB |
| trajectories / gpt41_kimik25 | telecom | 114 | 114 | 14 MB |
| trajectories / kimik25 | airline | 50 | 50 | 3.6 MB |
| trajectories / kimik25 | retail | 114 | 114 | 8.1 MB |
| trajectories / kimik25 | telecom | 114 | 114 | 15 MB |
| logprobs / gpt41_kimik25 | all | 278 | 278 | 783 MB |
| logprobs / kimik25 | all | 278 | 278 | 1.2 GB |
## Data Format
### Trajectories
Each trajectory JSON file contains:
```json
{
"timestamp": "...",
"info": {
"num_trials": 1,
"max_steps": 200,
"user_info": {"implementation": "user_simulator", "llm": "azure/Kimi-K2.5", ...},
"agent_info": {"implementation": "llm_agent", "llm": "azure/gpt-4.1", ...}
},
"tasks": [...],
"simulations": [
{
"id": "...",
"task_id": "0",
"messages": [...],
"reward_info": {"reward": 1.0, ...},
"uq_summary": {
"assistant": {
"total_tokens": 304,
"trajectory_nll": 50.02,
"avg_token_nll": 0.165,
"mean_topk_entropy": 0.317,
"min_chosen_prob": 0.298
},
"user": {...},
"combined": {...}
}
}
]
}
```
### Token Logprobs (JSONL)
Each line in a logprobs JSONL file is one token record:
```json
{
"model": "gpt-4.1",
"role": "assistant",
"turn_idx": 0,
"token_idx": 15,
"token": " help",
"chosen_logprob": -0.023,
"chosen_prob": 0.977,
"topk_tokens": ["help", "assist", ...],
"topk_logprobs": [-0.023, -4.12, ...],
"topk_entropy": 0.18,
"topk_mass": 0.995,
"domain": "airline",
"task_id": "0",
"trial": 0,
"seed": 626729
}
```
## Usage
### Download specific files
```python
from huggingface_hub import hf_hub_download
# Download a trajectory file
path = hf_hub_download(
repo_id="changdae/tau2-uq-artifacts",
filename="trajectories/gpt41_kimik25_airline.json",
repo_type="dataset",
)
```
### Load logprobs as a dataset
```python
from datasets import load_dataset
# Load all airline logprobs for GPT-4.1 + Kimi-K2.5
ds = load_dataset(
"changdae/tau2-uq-artifacts",
name="logprobs_gpt41_kimik25",
split="airline",
)
```
### Load trajectories
```python
import json
# After downloading
with open(path) as f:
data = json.load(f)
for sim in data["simulations"]:
task_id = sim["task_id"]
reward = sim["reward_info"]["reward"]
avg_nll = sim["uq_summary"]["assistant"]["avg_token_nll"]
print(f"Task {task_id}: reward={reward}, avg_token_nll={avg_nll:.4f}")
```
## Citation
If you use these artifacts, please consider citing our agent UQ position paper as well as the original tau2-bench paper:
```bibtex
@article{oh2026uncertainty,
title={Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities},
author={Oh, Changdae and Park, Seongheon and Kim, To Eun and Li, Jiatong and Li, Wendi and Yeh, Samuel and Du, Xuefeng and Hassani, Hamed and Bogdan, Paul and Song, Dawn and others},
journal={arXiv preprint arXiv:2602.05073},
year={2026}
}
```
```bibtex
@article{barres2025tau,
title={$$\backslash$tau\^{} 2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik},
journal={arXiv preprint arXiv:2506.07982},
year={2025}
}
```
## License
MIT
提供机构:
changdae



