jimmywang585/triage-bench

Name: jimmywang585/triage-bench
Creator: jimmywang585
Published: 2026-04-05 18:05:20
License: 暂无描述

Hugging Face2026-04-05 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/jimmywang585/triage-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en task_categories: - text-classification size_categories: - n<1K pretty_name: "Who&When Intervention-Priority Benchmark" tags: - multi-agent - error-analysis - ranking - llm-as-judge configs: - config_name: default data_files: - split: train path: data/traces.parquet - config_name: pairwise_detail data_files: - split: train path: pairwise_detail/pairwise_detail.parquet --- # Who&When Intervention-Priority Benchmark This dataset adds intervention-priority annotations to a primary subset of the public [Who&When Hugging Face dataset](https://huggingface.co/datasets/Kevin355/Who_and_When). The default view is trace-centric: one row is one failed trace, with the original Who&When trace content plus nested candidate rankings and pairwise comparison summaries. ## Source - Who&When dataset: <https://huggingface.co/datasets/Kevin355/Who_and_When> - Who&When paper: <https://arxiv.org/abs/2505.00212> The public release contains 177 traces: - 122 `Algorithm-Generated` - 55 `Hand-Crafted` ## Dataset Format ### `default` config One row is one trace. The key columns are: - `trace_id` - `subset` - `question` - `groundtruth_answer` - `mistake_step` - `mistake_agent` - `n_steps` - `history` - `admissibility_label` - `n_candidates` - `top1_step_id` - `top1_agent` - `top1_bt_probability` - `candidates` - `pairwise_comparisons` `history` follows the original Who&When trace format: a list of messages with `content`, `name`, and `role`. Step content is stored in full and is not truncated. `candidates` is ordered by rank and stores one nested record per candidate repair point. Each nested record includes: - `step_id` - `agent` - `bt_rank` - `bt_score` - `bt_ci_low` - `bt_ci_high` - `bt_top1_probability` - `severity_tier` - `plausibility_consensus` - per-judge plausibility, recovery-anchor, rationale, and annotation-status fields using full annotator-id prefixes, e.g. `openai_gpt_5_4_plausibility` `pairwise_comparisons` stores one nested record per canonical candidate pair with: - `candidate_a_step` - `candidate_b_step` - `plurality_winner` - `winner_consensus` - `pairwise_agreement` - per-judge pairwise vote fields using full annotator-id suffixes, e.g. `openai_gpt_5_4_vote` ### `pairwise_detail` config This supplementary table has one row per canonical pairwise comparison. It preserves the full pairwise rationales and order-swap metadata for researchers who want to inspect judge disagreement or refit ranking models. Per-judge vote, winner, and order-status columns also use full annotator-id prefixes for consistency with the default config. ## Loading ```python from datasets import load_dataset traces = load_dataset("jimmywang585/triage-bench")["train"] trace = traces[0] print(trace["question"]) print(trace["history"][0]["content"]) print(trace["candidates"][0]["step_id"], trace["candidates"][0]["bt_rank"]) pairwise_detail = load_dataset("jimmywang585/triage-bench", "pairwise_detail")["train"] print(pairwise_detail[0]["pair_id"]) ``` ## Citation If you use these traces, please cite the original Who&When paper: ```bibtex @article{zhang2025agent, title={Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems}, author={Zhang, Shaokun and Yin, Ming and Zhang, Jieyu and Liu, Jiale and Han, Zhiguang and Zhang, Jingyang and Li, Beibin and Wang, Chi and Wang, Huazheng and Chen, Yiran and others}, journal={arXiv preprint arXiv:2505.00212}, year={2025} } ``` ## License CC-BY-4.0

提供机构：

jimmywang585

5,000+

优质数据集

54 个

任务类型

进入经典数据集