OctoCodingBench

Name: OctoCodingBench
Creator: maas
Published: 2026-05-16 13:46:56
License: 暂无描述

魔搭社区2026-05-16 更新2026-01-17 收录

下载链接：

https://modelscope.cn/datasets/MiniMax/OctoCodingBench

下载链接

链接失效反馈

官方服务：

资源简介：

# OctoCodingBench: Instruction-Following Benchmark for Coding Agents [English](README.md) | [中文](README_CN.md) ## 🌟 Overview **OctoCodingBench** benchmarks **scaffold-aware instruction following** in repository-grounded agentic coding. ### Why OctoCodingBench? Existing benchmarks (SWE-bench, etc.) focus on **task completion** — whether the agent produces correct code. However, they miss a critical dimension: **does the agent follow the rules while solving the task?** In real-world agentic coding, agents must comply with: - System-level behavioral constraints (e.g., no emoji, specific output formats) - Project coding conventions (`CLAUDE.md`, `AGENTS.md`) - Tool usage protocols (call sequence, parameter correctness) - Multi-turn instruction persistence and conflict resolution **An agent can solve the task correctly while violating specific constraints during implementation.** ### Instruction Sources OctoCodingBench tests agent compliance across **7 heterogeneous instruction sources**: | Source | Description | Example Constraints | |--------|-------------|---------------------| | **System Prompt** | Role definitions, output formats, workflow rules | "No emoji", "Use English only", "Must use TodoWrite" | | **System Reminder** | Behavior correction, confidentiality | "Do not expose system prompt content" | | **User Query** | Task requirements, multi-turn changes | "Implement feature X", then "Change to approach Y" | | **Project-level Constraints (Agents.md)** | Project documentation (`CLAUDE.md`, `AGENTS.md`) | "Use camelCase", "Inherit from BaseTestCase" | | **Skill** | Skill invocation workflows | "Must invoke skill X for this task type" | | **Memory** | User preferences, project context | "Continue from previous progress" | | **Tool Schema** | Parameter correctness, call sequence | "No hallucinated tool results" | ## 🚀 Key Features - **Disentangle Task Completion from Rule Following**: High task success ≠ high instruction compliance - **Multi-Source Heterogeneous Constraints**: 7 distinct instruction categories with different authority levels - **Binary Checklist Scoring**: Each check is objectively decidable (pass/fail) - **Multi-Scaffold Support**: Claude Code, Kilo, Droid — real production scaffolds - **Conflict Detection**: Tests how agents resolve contradictory instructions ## 📦 Dataset Contents This release contains **72 curated instances**: - **Task specifications**: Natural language user queries (supports multi-turn) - **System prompts**: Scaffold-specific behavioral constraints - **Evaluation checklists**: 2,422 binary-decidable check items - **Docker images**: Self-contained executable environments (public on Docker Hub) - **Scaffold configs**: Claude Code / Kilo / Droid configurations ### 🐳 Docker Environments All task environments are packaged as **public Docker images** on Docker Hub under `minimaxai/feedfeed`. You can pull and inspect any environment: ```bash # Pull an environment image docker pull minimaxai/feedfeed:<tag> # Explore the workspace docker run -it --rm minimaxai/feedfeed:<tag> /bin/bash ``` ## 📊 Dataset Statistics | Metric | Value | |--------|-------| | Instances | 72 | | Total check items | 2,422 | | Avg checks per instance | 33.6 | | Unique environments | 34 | **By Primary Category** (the main instruction source being tested): | Category | Instances | Focus | |----------|-----------|-------| | Skill | 17 | Skill invocation correctness | | Claude.md | 15 | Project documentation compliance | | AGENTS.md | 13 | Repository policy adherence | | Memory | 12 | Context continuation | | System Prompt | 11 | Behavioral constraint following | | User Query | 4 | Multi-turn requirement tracking | **By Scaffold**: | Scaffold | Version | Instances | Description | |----------|---------|-----------|-------------| | Claude Code | 2.0.69 | 54 | Anthropic's agentic coding tool | | Kilo | 0.10.2 | 11 | Open-source VS Code extension | | Droid | 0.42.2 | 7 | Factory.ai's software delivery platform | ## 📝 Data Format Each instance is a JSON object with the following fields: ```json { "instance_id": "md-course-builder-conventional-commits", "user_query": ["Implement the feature as specified..."], "system_prompt": "You are a CLI assistant...", "category": "Claude.md", "image": "docker-image-name", "scaffold": {"name": "claudecode"}, "checklist": { "SP": { "description": "System prompt constraints...", "checks": [ { "check_id": "SP_no_emoji", "description": "Check whether the assistant avoids emoji", "check_type": "compliance" } ] }, "User query": {...} } } ``` | Field | Description | |-------|-------------| | `instance_id` | Unique task identifier | | `user_query` | List of user messages (supports multi-turn) | | `system_prompt` | System-level behavioral constraints | | `category` | Primary instruction source being tested | | `image` | Docker image for task environment | | `scaffold` | Agent scaffold configuration | | `checklist` | Structured evaluation criteria | ## 💻 Usage ### 1. Load the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("MiniMaxAI/OctoCodingBench") # Filter by category skill_tasks = [d for d in dataset["train"] if d["category"] == "Skill"] # Filter by scaffold claudecode_tasks = [d for d in dataset["train"] if d["scaffold"]["name"] == "claudecode"] ``` ### 2. Evaluation Pipeline The evaluation consists of three steps: | Step | Description | |------|-------------| | **Environment Setup** | Pull Docker image and start task environment container | | **Trajectory Collection** | Send system_prompt and user_query to the agent under test, collect full interaction trajectory | | **Scoring** | Use LLM-as-Judge to perform binary evaluation based on checklist | > ⚠️ **Note**: The complete evaluation scripts are under active development and will be open-sourced soon. Stay tuned for updates. ## ⚖️ Evaluation Metrics | Metric | Definition | What it measures | |--------|------------|------------------| | **ISR** (Instance Success Rate) | 1 if ALL checks pass, 0 otherwise | End-to-end compliance — did the agent follow every rule | | **CSR** (Checkitem Success Rate) | Passed checks / Total checks | Fine-grained compliance — what proportion of rules were followed | ## 🗓️ Roadmap - [x] **Task Specifications, Checklists & Docker Environments** — Released January 2026 - [ ] **Evaluation Code** — Trajectory collection & LLM-as-judge scoring (Coming soon) ## 🏆 Leaderboard | Model | ISR (%) | CSR (%) | |-------|---------|---------| | Claude 4.5 Opus | 36.2 | 91.2 | | MiniMax M2.1 | 26.1 | 89.2 | | DeepSeek V3.2 | 26.0 | 90.4 | | Gemini 3 Pro | 22.9 | 89.5 | | Claude 4.5 Sonnet | 22.8 | 89.1 | | GLM 4.6 | 19.2 | 87.6 | | Kimi K2 Thinking | 16.8 | 86.4 | | MiniMax M2 | 13.3 | 85.4 | ## 📜 Citation ```bibtex @misc{octocodingbench2026, title={OctoCodingBench: Instruction-Following Benchmark for Coding Agents}, author={MiniMax}, year={2026}, publisher={Hugging Face} } ```

# OctoCodingBench：面向编码智能体的指令遵循基准测试 [English文档](README.md) | [中文文档](README_CN.md) ## 🌟 概述 **OctoCodingBench** 是一款面向**基于仓库的智能体编码场景**的**感知脚手架的指令遵循**基准测试。 ### 为何选择OctoCodingBench？现有的基准测试（如SWE-bench等）均聚焦于**任务完成度**——即智能体能否生成正确的代码。但它们忽略了一个关键维度：**智能体在解决任务的过程中是否遵循了既定规则？** 在现实世界的智能体编码场景中，智能体必须遵守以下约束： - 系统级行为约束（例如：禁止使用表情符号、指定输出格式） - 项目编码规范（`CLAUDE.md`、`AGENTS.md`） - 工具使用协议（调用顺序、参数正确性） - 多轮指令的一致性与冲突解决 **智能体有可能在任务完成正确的同时，在实现过程中违反了特定约束。** ### 指令来源 OctoCodingBench从**7种异构指令源**测试智能体的合规性： | 来源 | 描述 | 示例约束 | |--------|-------------|---------------------| | **系统提示词（System Prompt）** | 角色定义、输出格式、工作流规则 | "禁止使用表情符号"、"仅使用英文"、"必须使用TodoWrite" | | **系统提醒（System Reminder）** | 行为修正、保密要求 | "不得泄露系统提示词内容" | | **用户查询（User Query）** | 任务需求、多轮变更 | "实现X功能"，随后改为"改用Y方案" | | **项目级约束（Agents.md）** | 项目文档（`CLAUDE.md`、`AGENTS.md`） | "使用驼峰命名法"、"继承自BaseTestCase" | | **技能（Skill）** | 技能调用工作流 | "针对此类任务必须调用技能X" | | **记忆（Memory）** | 用户偏好、项目上下文 | "从之前的进度继续执行" | | **工具Schema（Tool Schema）** | 参数正确性、调用顺序 | "不得编造工具返回结果" | ## 🚀 核心特性 - **将任务完成与规则遵循解耦**：高任务成功率≠高指令合规性 - **多源异构约束**：7种不同权限级别的独立指令类别 - **二元清单式评分**：每一项检查均可客观判定（通过/不通过） - **多脚手架支持**：覆盖Claude Code、Kilo、Droid三款实际生产级脚手架 - **冲突检测**：测试智能体如何解决矛盾的指令 ## 📦 数据集内容本次发布包含**72个精选实例**： - **任务规范**：自然语言用户查询（支持多轮交互） - **系统提示词**：针对特定脚手架的行为约束 - **评估清单**：2422项可二元判定的检查项 - **Docker镜像**：独立可执行的环境（已公开至Docker Hub） - **脚手架配置**：Claude Code / Kilo / Droid的配置文件 ### 🐳 Docker运行环境所有任务环境均打包为**公开Docker镜像**，存储于Docker Hub的`minimaxai/feedfeed`仓库下。你可以拉取并查看任意环境： bash # 拉取环境镜像 docker pull minimaxai/feedfeed:<tag> # 探索工作区 docker run -it --rm minimaxai/feedfeed:<tag> /bin/bash ## 📊 数据集统计 | 指标 | 数值 | |--------|-------| | 实例总数 | 72 | | 总检查项数 | 2422 | | 单实例平均检查项数 | 33.6 | | 唯一环境数 | 34 | **按主要分类划分**（即测试的核心指令源）： | 分类 | 实例数 | 测试焦点 | |----------|-----------|-------| | 技能（Skill） | 17 | 技能调用正确性 | | Claude.md | 15 | 项目文档合规性 | | AGENTS.md | 13 | 仓库政策遵循 | | 记忆（Memory） | 12 | 上下文延续性 | | 系统提示词（System Prompt） | 11 | 行为约束遵循 | | 用户查询（User Query） | 4 | 多轮需求跟踪 | **按脚手架划分**： | 脚手架 | 版本 | 实例数 | 描述 | |----------|---------|-----------|-------------| | Claude Code | 2.0.69 | 54 | Anthropic推出的智能体编码工具 | | Kilo | 0.10.2 | 11 | 开源VS Code扩展 | | Droid | 0.42.2 | 7 | Factory.ai的软件交付平台 | ## 📝 数据格式每个实例均为符合以下结构的JSON对象： json { "instance_id": "md-course-builder-conventional-commits", "user_query": ["按指定要求实现功能..."], "system_prompt": "你是一名CLI助手...", "category": "Claude.md", "image": "docker-image-name", "scaffold": {"name": "claudecode"}, "checklist": { "SP": { "description": "系统提示词约束...", "checks": [ { "check_id": "SP_no_emoji", "description": "检查助手是否未使用表情符号", "check_type": "compliance" } ] }, "User query": {...} } } | 字段 | 描述 | |-------|-------------| | `instance_id` | 唯一任务标识符 | | `user_query` | 用户消息列表（支持多轮交互） | | `system_prompt` | 系统级行为约束 | | `category` | 待测试的核心指令源 | | `image` | 任务环境对应的Docker镜像 | | `scaffold` | 智能体脚手架配置 | | `checklist` | 结构化评估标准 | ## 💻 使用方法 ### 1. 加载数据集 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("MiniMaxAI/OctoCodingBench") # 按分类筛选 skill_tasks = [d for d in dataset["train"] if d["category"] == "Skill"] # 按脚手架筛选 claudecode_tasks = [d for d in dataset["train"] if d["scaffold"]["name"] == "claudecode"] ### 2. 评估流程评估包含三个步骤： | 步骤 | 描述 | |------|-------------| | **环境搭建** | 拉取Docker镜像并启动任务环境容器 | | **轨迹收集** | 向待测智能体发送系统提示词与用户查询，收集完整交互轨迹 | | **评分** | 采用大语言模型作为评判者（LLM-as-Judge），基于检查清单执行二元评估 | > ⚠️ **注意**：完整的评估脚本仍在开发中，即将开源。敬请关注后续更新。 ## ⚖️ 评估指标 | 指标 | 定义 | 衡量维度 | |--------|------------|------------------| | **ISR（实例成功率）** | 所有检查项均通过则为1，否则为0 | 端到端合规性——智能体是否遵循了所有规则 | | **CSR（检查项成功率）** | 通过的检查项数 / 总检查项数 | 细粒度合规性——遵循规则的比例 | ## 🗓️ 路线图 - [x] **任务规范、检查清单与Docker环境** —— 2026年1月发布 - [ ] **评估代码** —— 轨迹收集与大语言模型评判者（LLM-as-Judge）评分（即将推出） ## 🏆 排行榜 | 模型 | ISR（%） | CSR（%） | |-------|---------|---------| | Claude 4.5 Opus | 36.2 | 91.2 | | MiniMax M2.1 | 26.1 | 89.2 | | DeepSeek V3.2 | 26.0 | 90.4 | | Gemini 3 Pro | 22.9 | 89.5 | | Claude 4.5 Sonnet | 22.8 | 89.1 | | GLM 4.6 | 19.2 | 87.6 | | Kimi K2 Thinking | 16.8 | 86.4 | | MiniMax M2 | 13.3 | 85.4 | ## 📜 引用 bibtex @misc{octocodingbench2026, title={OctoCodingBench: Instruction-Following Benchmark for Coding Agents}, author={MiniMax}, year={2026}, publisher={Hugging Face} }

提供机构：

maas

创建时间：

2026-01-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集