five

debdootmiitd/tau3-bench-qwen3.6-35b-a3b-v0

收藏
Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/debdootmiitd/tau3-bench-qwen3.6-35b-a3b-v0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - tau-bench - tau2-bench - tau3-bench - agent-evaluation - customer-service pretty_name: τ³-bench Phase 0 baseline — Qwen3.6-35B-A3B (V0) size_categories: - 1K<n<10K --- # τ³-bench Phase 0 baseline — Qwen3.6-35B-A3B (V0) Trajectory data + operational artifacts for the **Comvera Phase 0 baseline** of `Qwen/Qwen3.6-35B-A3B` on [`sierra-research/tau2-bench`](https://github.com/sierra-research/tau2-bench) (commit `3b005ddb...`, equivalent to τ³-bench v1.0.0). Companion to leaderboard PR [sierra-research/tau2-bench#267](https://github.com/sierra-research/tau2-bench/pull/267). ## Dataset layout ``` trajectories/ ← OFFICIAL SUBMISSION DATA (Config A, thinking on) ├── airline_results.json 50 tasks × 4 trials = 200 sims ├── retail_results.json 114 × 4 = 456 sims └── telecom_results.json 114 × 4 = 456 sims config_b_trajectories/ Config B (thinking DISABLED) for comparison ├── airline_results.json ├── retail_results.json └── telecom_results.json reviewed/ LLM-judge auto-error-identification output ├── config_a_*_reviewed.json (judge: gpt-4.1, identifies fault types per turn) └── config_b_*_reviewed.json logs/ Run-time logs (vLLM serve, tau2 run, tau2 review, auto-resume retries, OpenAI quota incident) reports/ ├── phase0_results.md Final report — methodology, all 18 baseline cells, fault breakdown, reproducibility └── progress.md Chronological journal of decisions, deviations, and incidents during the Phase 0 run submission_package/ └── submission.json Schema-valid leaderboard submission JSON (also lives at sierra-research/tau2-bench#267) ``` ## Configuration (Config A — the headline submission) | Field | Value | |---|---| | Model | `Qwen/Qwen3.6-35B-A3B` (HF SHA `995ad96e`) | | Serving | vLLM 0.19.1, `--reasoning-parser qwen3 --tool-call-parser qwen3_xml --enable-auto-tool-choice` | | Agent llm_args | `{"temperature": 0.6}` (default Qwen3.5 chat template, thinking-mode available) | | User simulator | `gpt-4.1-2025-04-14`, temperature 0.0 | | Trials per task | 4 | | Task split | `base` (default) | | Domains | airline (50), retail (114), telecom (114) | | Banking_knowledge | not run (deferred) | | Scaffold | default tau-bench scaffold; no fine-tuning, no prompt edits | ## Headline metrics ### Config A (thinking) | Domain | Pass^1 | Pass^2 | Pass^3 | Pass^4 | |---|---:|---:|---:|---:| | airline | 0.810 | 0.743 | 0.705 | **0.680** | | retail | 0.833 | 0.746 | 0.682 | **0.632** | | telecom | 0.993 | 0.987 | 0.980 | **0.974** | ### Config B (no-thinking) | Domain | Pass^1 | Pass^2 | Pass^3 | Pass^4 | |---|---:|---:|---:|---:| | airline | 0.685 | 0.570 | 0.495 | **0.440** | | retail | 0.805 | 0.713 | 0.660 | **0.623** | | telecom | 0.998 | 0.996 | 0.993 | **0.991** | ## Telecom caveat (read this if comparing across submissions) 82 % of telecom tasks (94/114) have `reward_basis=('ENV_ASSERTION',)` only — the eval checks the device's end state but not whether the *agent* prescribed the fix sequence. The strict subset (20 tasks where `reward_basis=('ENV_ASSERTION', 'ACTION')`) gives a more conservative agent-quality measure: **Config A pass^4 = 0.850**. The reported headline 0.974 matches benchmark convention. This is a property of τ³-bench v1.0's task definitions, not of our evaluation pipeline. See `reports/phase0_results.md` for the full diagnostic. ## Reproducing The full code (vLLM launch, tau2 run wrappers, review wrappers, master script) lives at https://github.com/debdootiitd/tau-bench-phase0 (companion repository). Quick reproduction headline: ```bash tau2 run --domain airline --agent-llm hosted_vllm/Qwen3.6-35B-A3B \ --agent-llm-args '{"temperature": 0.6}' \ --user-llm gpt-4.1 --user-llm-args '{"temperature": 0.0}' \ --num-trials 4 --max-concurrency 4 --auto-resume --save-to phase0_config_a_airline # repeat for retail, telecom; then `tau2 review` each results.json for fault analysis. ``` Wall time ~3h on 1× H200 (3 domains in parallel, concurrency 4 each). OpenAI user-simulator + auto-review spend ~$90 across all 12 cells.
提供机构:
debdootmiitd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作