five

harithoppil/terminal-bench-2-trajectories

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/harithoppil/terminal-bench-2-trajectories
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 pretty_name: Terminal Bench 2 Leaderboard Trajectories tags: - leaderboard - benchmark - code - terminal-bench size_categories: - 1K<n<10K task_categories: - text-generation language: - en configs: - config_name: all data_files: - split: train path: data/leaderboard_trajectories.jsonl - config_name: pass data_files: - split: train path: data/leaderboard_trajectories_pass.jsonl - config_name: ml data_files: - split: train path: data/leaderboard_trajectories_ml.jsonl - config_name: ml_pass data_files: - split: train path: data/leaderboard_trajectories_ml_pass.jsonl dataset_info: features: - name: task_name dtype: string - name: model dtype: string - name: agent dtype: string - name: prompt dtype: string - name: response dtype: string - name: reward dtype: float64 - name: elapsed_seconds dtype: float64 splits: - name: train num_examples: 3723 --- # Terminal-Bench 2.0 Leaderboard Trajectories Agent trajectories extracted from [Terminal-Bench 2.0](https://terminal-bench.org) leaderboard submissions. Each row contains a prompt (task instruction), the agent's response, and the reward (pass/fail). ## Models Included | Model | Trials | Passed | |-------|--------|--------| | Claude-Opus-4.6 | 2,213 | 1,537 (69%) | | Gemini-3.1-Pro-Preview | 445 | 333 (75%) | | GLM-5 | 445 | 231 (52%) | | Kimi-k2.5 | 442 | 189 (43%) | | Claude-Opus-4.5 | 178 | 98 (55%) | ## Splits | Config | Description | Rows | |--------|-------------|------| | `all` | All trajectories | 3,723 | | `pass` | Only passed (reward=1.0) | 2,388 | | `ml` | ML/training-related tasks (16 tasks) | 672 | | `ml_pass` | ML tasks, passed only | 310 | ## ML Tasks caffe-cifar-10, distribution-search, gpt2-codegolf, hf-model-inference, llm-inference-batching-scheduler, model-extraction-relu-logits, mteb-leaderboard, mteb-retrieve, pytorch-model-cli, pytorch-model-recovery, reshard-c4-data, sam-cell-seg, torch-pipeline-parallelism, torch-tensor-parallelism, train-fasttext, tune-mjcf ## Schema - `task_name`: Task identifier (89 unique tasks) - `model`: Model used (e.g. Claude-Opus-4.6, GLM-5) - `agent`: Agent framework (Terminus2, Mux, OpenCode, etc.) - `prompt`: Full task instruction sent to the agent - `response`: Agent's output (from trajectory.json or stdout.txt) - `reward`: 1.0 = passed, 0.0 = failed - `elapsed_seconds`: Time to complete ## Citation ```bibtex @article{merrill2026terminal, title={Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces}, author={Merrill, Mike A and Shaw, Alexander G and Carlini, Nicholas and others}, journal={arXiv preprint arXiv:2601.11868}, year={2026} } ```
提供机构:
harithoppil
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作