XUO/terminal-bench
收藏Hugging Face2025-12-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/XUO/terminal-bench
下载链接
链接失效反馈官方服务:
资源简介:
Terminal-Bench数据集包含来自[Terminal-Bench](https://github.com/laude-institute/terminal-bench)的任务,这是一个用于在真实终端环境中评估AI代理的基准。每个任务都打包为一个完整的、自包含的存档,保留了精确的目录结构、二进制文件、Docker配置和测试脚本,以确保忠实再现。数据集使研究人员能够:以字节完美的保真度重现官方结果;构建自定义评估工具同时保持兼容性;在不提取存档的情况下分析任务特征;根据已建立的排行榜对新代理进行基准测试。数据集结构包括任务ID、存档(包含完整任务目录的gzipped tarball)、任务YAML内容、难度级别、标签、作者电子邮件、主要任务类别、基本描述文本、代理执行超时时间、测试验证超时时间、SHA-256哈希值、压缩存档大小、任务目录中的文件数量以及存档创建时间戳。
This dataset contains tasks from [Terminal-Bench](https://github.com/laude-institute/terminal-bench), a benchmark for evaluating AI agents in real terminal environments. Each task is packaged as a complete, self-contained archive that preserves the exact directory structure, binary files, Docker configurations, and test scripts needed for faithful reproduction. The dataset enables researchers to: **Reproduce official results** with byte-perfect fidelity; **Build custom evaluation harnesses** while maintaining compatibility; **Analyze task characteristics** without extracting archives; **Benchmark new agents** against the established leaderboard. The dataset structure includes task_id, archive (gzipped tarball containing the complete task directory), task_yaml (raw YAML content with task description and configuration), difficulty (easy/medium/hard), tags (list of strings for categorization), author_email (contact email of task author), category (primary task category), base_description (main task instruction text), max_agent_timeout_sec (agent execution timeout in seconds), max_test_timeout_sec (test validation timeout in seconds), tar_sha256 (SHA-256 hash for integrity verification), archive_bytes (size of the compressed archive in bytes), n_files (number of files in the task directory), and generated_at (ISO-8601 timestamp when the archive was created).
提供机构:
XUO



