debdootmiitd/tau3-bench-qwen3.6-35b-a3b-v0
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/debdootmiitd/tau3-bench-qwen3.6-35b-a3b-v0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- tau-bench
- tau2-bench
- tau3-bench
- agent-evaluation
- customer-service
pretty_name: τ³-bench Phase 0 baseline — Qwen3.6-35B-A3B (V0)
size_categories:
- 1K<n<10K
---
# τ³-bench Phase 0 baseline — Qwen3.6-35B-A3B (V0)
Trajectory data + operational artifacts for the **Comvera Phase 0 baseline** of
`Qwen/Qwen3.6-35B-A3B` on
[`sierra-research/tau2-bench`](https://github.com/sierra-research/tau2-bench)
(commit `3b005ddb...`, equivalent to τ³-bench v1.0.0).
Companion to leaderboard PR
[sierra-research/tau2-bench#267](https://github.com/sierra-research/tau2-bench/pull/267).
## Dataset layout
```
trajectories/ ← OFFICIAL SUBMISSION DATA (Config A, thinking on)
├── airline_results.json 50 tasks × 4 trials = 200 sims
├── retail_results.json 114 × 4 = 456 sims
└── telecom_results.json 114 × 4 = 456 sims
config_b_trajectories/ Config B (thinking DISABLED) for comparison
├── airline_results.json
├── retail_results.json
└── telecom_results.json
reviewed/ LLM-judge auto-error-identification output
├── config_a_*_reviewed.json (judge: gpt-4.1, identifies fault types per turn)
└── config_b_*_reviewed.json
logs/ Run-time logs (vLLM serve, tau2 run, tau2 review,
auto-resume retries, OpenAI quota incident)
reports/
├── phase0_results.md Final report — methodology, all 18 baseline cells,
fault breakdown, reproducibility
└── progress.md Chronological journal of decisions, deviations,
and incidents during the Phase 0 run
submission_package/
└── submission.json Schema-valid leaderboard submission JSON
(also lives at sierra-research/tau2-bench#267)
```
## Configuration (Config A — the headline submission)
| Field | Value |
|---|---|
| Model | `Qwen/Qwen3.6-35B-A3B` (HF SHA `995ad96e`) |
| Serving | vLLM 0.19.1, `--reasoning-parser qwen3 --tool-call-parser qwen3_xml --enable-auto-tool-choice` |
| Agent llm_args | `{"temperature": 0.6}` (default Qwen3.5 chat template, thinking-mode available) |
| User simulator | `gpt-4.1-2025-04-14`, temperature 0.0 |
| Trials per task | 4 |
| Task split | `base` (default) |
| Domains | airline (50), retail (114), telecom (114) |
| Banking_knowledge | not run (deferred) |
| Scaffold | default tau-bench scaffold; no fine-tuning, no prompt edits |
## Headline metrics
### Config A (thinking)
| Domain | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
|---|---:|---:|---:|---:|
| airline | 0.810 | 0.743 | 0.705 | **0.680** |
| retail | 0.833 | 0.746 | 0.682 | **0.632** |
| telecom | 0.993 | 0.987 | 0.980 | **0.974** |
### Config B (no-thinking)
| Domain | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
|---|---:|---:|---:|---:|
| airline | 0.685 | 0.570 | 0.495 | **0.440** |
| retail | 0.805 | 0.713 | 0.660 | **0.623** |
| telecom | 0.998 | 0.996 | 0.993 | **0.991** |
## Telecom caveat (read this if comparing across submissions)
82 % of telecom tasks (94/114) have `reward_basis=('ENV_ASSERTION',)` only — the
eval checks the device's end state but not whether the *agent* prescribed the
fix sequence. The strict subset (20 tasks where `reward_basis=('ENV_ASSERTION',
'ACTION')`) gives a more conservative agent-quality measure: **Config A pass^4 =
0.850**. The reported headline 0.974 matches benchmark convention.
This is a property of τ³-bench v1.0's task definitions, not of our evaluation
pipeline. See `reports/phase0_results.md` for the full diagnostic.
## Reproducing
The full code (vLLM launch, tau2 run wrappers, review wrappers, master script)
lives at https://github.com/debdootiitd/tau-bench-phase0 (companion repository).
Quick reproduction headline:
```bash
tau2 run --domain airline --agent-llm hosted_vllm/Qwen3.6-35B-A3B \
--agent-llm-args '{"temperature": 0.6}' \
--user-llm gpt-4.1 --user-llm-args '{"temperature": 0.0}' \
--num-trials 4 --max-concurrency 4 --auto-resume --save-to phase0_config_a_airline
# repeat for retail, telecom; then `tau2 review` each results.json for fault analysis.
```
Wall time ~3h on 1× H200 (3 domains in parallel, concurrency 4 each).
OpenAI user-simulator + auto-review spend ~$90 across all 12 cells.
提供机构:
debdootmiitd



