harithoppil/terminal-bench-2-trajectories
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/harithoppil/terminal-bench-2-trajectories
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
pretty_name: Terminal Bench 2 Leaderboard Trajectories
tags:
- leaderboard
- benchmark
- code
- terminal-bench
size_categories:
- 1K<n<10K
task_categories:
- text-generation
language:
- en
configs:
- config_name: all
data_files:
- split: train
path: data/leaderboard_trajectories.jsonl
- config_name: pass
data_files:
- split: train
path: data/leaderboard_trajectories_pass.jsonl
- config_name: ml
data_files:
- split: train
path: data/leaderboard_trajectories_ml.jsonl
- config_name: ml_pass
data_files:
- split: train
path: data/leaderboard_trajectories_ml_pass.jsonl
dataset_info:
features:
- name: task_name
dtype: string
- name: model
dtype: string
- name: agent
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: reward
dtype: float64
- name: elapsed_seconds
dtype: float64
splits:
- name: train
num_examples: 3723
---
# Terminal-Bench 2.0 Leaderboard Trajectories
Agent trajectories extracted from [Terminal-Bench 2.0](https://terminal-bench.org) leaderboard submissions. Each row contains a prompt (task instruction), the agent's response, and the reward (pass/fail).
## Models Included
| Model | Trials | Passed |
|-------|--------|--------|
| Claude-Opus-4.6 | 2,213 | 1,537 (69%) |
| Gemini-3.1-Pro-Preview | 445 | 333 (75%) |
| GLM-5 | 445 | 231 (52%) |
| Kimi-k2.5 | 442 | 189 (43%) |
| Claude-Opus-4.5 | 178 | 98 (55%) |
## Splits
| Config | Description | Rows |
|--------|-------------|------|
| `all` | All trajectories | 3,723 |
| `pass` | Only passed (reward=1.0) | 2,388 |
| `ml` | ML/training-related tasks (16 tasks) | 672 |
| `ml_pass` | ML tasks, passed only | 310 |
## ML Tasks
caffe-cifar-10, distribution-search, gpt2-codegolf, hf-model-inference, llm-inference-batching-scheduler, model-extraction-relu-logits, mteb-leaderboard, mteb-retrieve, pytorch-model-cli, pytorch-model-recovery, reshard-c4-data, sam-cell-seg, torch-pipeline-parallelism, torch-tensor-parallelism, train-fasttext, tune-mjcf
## Schema
- `task_name`: Task identifier (89 unique tasks)
- `model`: Model used (e.g. Claude-Opus-4.6, GLM-5)
- `agent`: Agent framework (Terminus2, Mux, OpenCode, etc.)
- `prompt`: Full task instruction sent to the agent
- `response`: Agent's output (from trajectory.json or stdout.txt)
- `reward`: 1.0 = passed, 0.0 = failed
- `elapsed_seconds`: Time to complete
## Citation
```bibtex
@article{merrill2026terminal,
title={Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces},
author={Merrill, Mike A and Shaw, Alexander G and Carlini, Nicholas and others},
journal={arXiv preprint arXiv:2601.11868},
year={2026}
}
```
提供机构:
harithoppil



