five

changdae/tau2-uq-artifacts

收藏
Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/changdae/tau2-uq-artifacts
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - uncertainty-quantification - agent pretty_name: tau2-bench UQ Artifacts size_categories: - 1K<n<10K configs: - config_name: trajectories data_files: - split: gpt41_kimik25_airline path: trajectories/gpt41_kimik25_airline.json - split: gpt41_kimik25_retail path: trajectories/gpt41_kimik25_retail.json - split: gpt41_kimik25_telecom path: trajectories/gpt41_kimik25_telecom.json - split: kimik25_airline path: trajectories/Kimi-K2.5_airline.json - split: kimik25_retail path: trajectories/Kimi-K2.5_retail.json - split: kimik25_telecom path: trajectories/Kimi-K2.5_telecom.json - config_name: logprobs_gpt41_kimik25 data_files: - split: airline path: logprobs/gpt41_kimik25/logprobs_airline_*.jsonl - split: retail path: logprobs/gpt41_kimik25/logprobs_retail_*.jsonl - split: telecom path: logprobs/gpt41_kimik25/logprobs_telecom_*.jsonl - config_name: logprobs_kimik25 data_files: - split: airline path: logprobs/Kimi-K2.5/logprobs_airline_*.jsonl - split: retail path: logprobs/Kimi-K2.5/logprobs_retail_*.jsonl - split: telecom path: logprobs/Kimi-K2.5/logprobs_telecom_*.jsonl --- # tau2-bench UQ Artifacts Interaction trajectories and token-level log-probability measurements from conversational customer service agent evaluations on [tau2-bench](https://github.com/sierra-research/tau2-bench), collected as part of the uncertainty quantification (UQ) pipeline. Used for analyses in the paper "[Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities](https://arxiv.org/pdf/2602.05073)" under the [agentuq codebase](https://github.com/deeplearning-wisc/agentuq). ## Dataset Overview This dataset contains two types of artifacts: 1. **Trajectories** -- Full multi-turn agent-user-tool interaction traces with embedded UQ summaries 2. **Token logprobs** -- Per-token log-probabilities and uncertainty measurements for every generation ### Configurations | Config | Agent LLM | User Simulator LLM | Domains | Description | |--------|-----------|---------------------|---------|-------------| | `trajectories` | GPT-4.1 / Kimi-K2.5 | Kimi-K2.5 | airline, retail, telecom | Full simulation results with rewards and UQ summaries | | `logprobs_gpt41_kimik25` | GPT-4.1 | Kimi-K2.5 | airline, retail, telecom | Token-level logprobs for each generation | | `logprobs_kimik25` | Kimi-K2.5 | Kimi-K2.5 | airline, retail, telecom | Token-level logprobs for each generation | ### Statistics | Config | Domain | Tasks | Trajectories | File Size | |--------|--------|-------|-------------|-----------| | trajectories / gpt41_kimik25 | airline | 50 | 50 | 3.2 MB | | trajectories / gpt41_kimik25 | retail | 114 | 114 | 7.6 MB | | trajectories / gpt41_kimik25 | telecom | 114 | 114 | 14 MB | | trajectories / kimik25 | airline | 50 | 50 | 3.6 MB | | trajectories / kimik25 | retail | 114 | 114 | 8.1 MB | | trajectories / kimik25 | telecom | 114 | 114 | 15 MB | | logprobs / gpt41_kimik25 | all | 278 | 278 | 783 MB | | logprobs / kimik25 | all | 278 | 278 | 1.2 GB | ## Data Format ### Trajectories Each trajectory JSON file contains: ```json { "timestamp": "...", "info": { "num_trials": 1, "max_steps": 200, "user_info": {"implementation": "user_simulator", "llm": "azure/Kimi-K2.5", ...}, "agent_info": {"implementation": "llm_agent", "llm": "azure/gpt-4.1", ...} }, "tasks": [...], "simulations": [ { "id": "...", "task_id": "0", "messages": [...], "reward_info": {"reward": 1.0, ...}, "uq_summary": { "assistant": { "total_tokens": 304, "trajectory_nll": 50.02, "avg_token_nll": 0.165, "mean_topk_entropy": 0.317, "min_chosen_prob": 0.298 }, "user": {...}, "combined": {...} } } ] } ``` ### Token Logprobs (JSONL) Each line in a logprobs JSONL file is one token record: ```json { "model": "gpt-4.1", "role": "assistant", "turn_idx": 0, "token_idx": 15, "token": " help", "chosen_logprob": -0.023, "chosen_prob": 0.977, "topk_tokens": ["help", "assist", ...], "topk_logprobs": [-0.023, -4.12, ...], "topk_entropy": 0.18, "topk_mass": 0.995, "domain": "airline", "task_id": "0", "trial": 0, "seed": 626729 } ``` ## Usage ### Download specific files ```python from huggingface_hub import hf_hub_download # Download a trajectory file path = hf_hub_download( repo_id="changdae/tau2-uq-artifacts", filename="trajectories/gpt41_kimik25_airline.json", repo_type="dataset", ) ``` ### Load logprobs as a dataset ```python from datasets import load_dataset # Load all airline logprobs for GPT-4.1 + Kimi-K2.5 ds = load_dataset( "changdae/tau2-uq-artifacts", name="logprobs_gpt41_kimik25", split="airline", ) ``` ### Load trajectories ```python import json # After downloading with open(path) as f: data = json.load(f) for sim in data["simulations"]: task_id = sim["task_id"] reward = sim["reward_info"]["reward"] avg_nll = sim["uq_summary"]["assistant"]["avg_token_nll"] print(f"Task {task_id}: reward={reward}, avg_token_nll={avg_nll:.4f}") ``` ## Citation If you use these artifacts, please consider citing our agent UQ position paper as well as the original tau2-bench paper: ```bibtex @article{oh2026uncertainty, title={Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities}, author={Oh, Changdae and Park, Seongheon and Kim, To Eun and Li, Jiatong and Li, Wendi and Yeh, Samuel and Du, Xuefeng and Hassani, Hamed and Bogdan, Paul and Song, Dawn and others}, journal={arXiv preprint arXiv:2602.05073}, year={2026} } ``` ```bibtex @article{barres2025tau, title={$$\backslash$tau\^{} 2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, author={Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik}, journal={arXiv preprint arXiv:2506.07982}, year={2025} } ``` ## License MIT
提供机构:
changdae
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作