jimmywang585/triage-bench
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jimmywang585/triage-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
task_categories:
- text-classification
size_categories:
- n<1K
pretty_name: "Who&When Intervention-Priority Benchmark"
tags:
- multi-agent
- error-analysis
- ranking
- llm-as-judge
configs:
- config_name: default
data_files:
- split: train
path: data/traces.parquet
- config_name: pairwise_detail
data_files:
- split: train
path: pairwise_detail/pairwise_detail.parquet
---
# Who&When Intervention-Priority Benchmark
This dataset adds intervention-priority annotations to a primary subset of the public [Who&When Hugging Face dataset](https://huggingface.co/datasets/Kevin355/Who_and_When). The default view is trace-centric: one row is one failed trace, with the original Who&When trace content plus nested candidate rankings and pairwise comparison summaries.
## Source
- Who&When dataset: <https://huggingface.co/datasets/Kevin355/Who_and_When>
- Who&When paper: <https://arxiv.org/abs/2505.00212>
The public release contains 177 traces:
- 122 `Algorithm-Generated`
- 55 `Hand-Crafted`
## Dataset Format
### `default` config
One row is one trace. The key columns are:
- `trace_id`
- `subset`
- `question`
- `groundtruth_answer`
- `mistake_step`
- `mistake_agent`
- `n_steps`
- `history`
- `admissibility_label`
- `n_candidates`
- `top1_step_id`
- `top1_agent`
- `top1_bt_probability`
- `candidates`
- `pairwise_comparisons`
`history` follows the original Who&When trace format: a list of messages with `content`, `name`, and `role`. Step content is stored in full and is not truncated.
`candidates` is ordered by rank and stores one nested record per candidate repair point. Each nested record includes:
- `step_id`
- `agent`
- `bt_rank`
- `bt_score`
- `bt_ci_low`
- `bt_ci_high`
- `bt_top1_probability`
- `severity_tier`
- `plausibility_consensus`
- per-judge plausibility, recovery-anchor, rationale, and annotation-status fields using full annotator-id prefixes, e.g. `openai_gpt_5_4_plausibility`
`pairwise_comparisons` stores one nested record per canonical candidate pair with:
- `candidate_a_step`
- `candidate_b_step`
- `plurality_winner`
- `winner_consensus`
- `pairwise_agreement`
- per-judge pairwise vote fields using full annotator-id suffixes, e.g. `openai_gpt_5_4_vote`
### `pairwise_detail` config
This supplementary table has one row per canonical pairwise comparison. It preserves the full pairwise rationales and order-swap metadata for researchers who want to inspect judge disagreement or refit ranking models. Per-judge vote, winner, and order-status columns also use full annotator-id prefixes for consistency with the default config.
## Loading
```python
from datasets import load_dataset
traces = load_dataset("jimmywang585/triage-bench")["train"]
trace = traces[0]
print(trace["question"])
print(trace["history"][0]["content"])
print(trace["candidates"][0]["step_id"], trace["candidates"][0]["bt_rank"])
pairwise_detail = load_dataset("jimmywang585/triage-bench", "pairwise_detail")["train"]
print(pairwise_detail[0]["pair_id"])
```
## Citation
If you use these traces, please cite the original Who&When paper:
```bibtex
@article{zhang2025agent,
title={Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems},
author={Zhang, Shaokun and Yin, Ming and Zhang, Jieyu and Liu, Jiale and Han, Zhiguang and Zhang, Jingyang and Li, Beibin and Wang, Chi and Wang, Huazheng and Chen, Yiran and others},
journal={arXiv preprint arXiv:2505.00212},
year={2025}
}
```
## License
CC-BY-4.0
提供机构:
jimmywang585



