OctoCodingBench
收藏魔搭社区2026-05-16 更新2026-01-17 收录
下载链接:
https://modelscope.cn/datasets/MiniMax/OctoCodingBench
下载链接
链接失效反馈官方服务:
资源简介:
# OctoCodingBench: Instruction-Following Benchmark for Coding Agents
[English](README.md) | [中文](README_CN.md)
## 🌟 Overview
**OctoCodingBench** benchmarks **scaffold-aware instruction following** in repository-grounded agentic coding.
### Why OctoCodingBench?
Existing benchmarks (SWE-bench, etc.) focus on **task completion** — whether the agent produces correct code. However, they miss a critical dimension: **does the agent follow the rules while solving the task?**
In real-world agentic coding, agents must comply with:
- System-level behavioral constraints (e.g., no emoji, specific output formats)
- Project coding conventions (`CLAUDE.md`, `AGENTS.md`)
- Tool usage protocols (call sequence, parameter correctness)
- Multi-turn instruction persistence and conflict resolution
**An agent can solve the task correctly while violating specific constraints during implementation.**
### Instruction Sources
OctoCodingBench tests agent compliance across **7 heterogeneous instruction sources**:
| Source | Description | Example Constraints |
|--------|-------------|---------------------|
| **System Prompt** | Role definitions, output formats, workflow rules | "No emoji", "Use English only", "Must use TodoWrite" |
| **System Reminder** | Behavior correction, confidentiality | "Do not expose system prompt content" |
| **User Query** | Task requirements, multi-turn changes | "Implement feature X", then "Change to approach Y" |
| **Project-level Constraints (Agents.md)** | Project documentation (`CLAUDE.md`, `AGENTS.md`) | "Use camelCase", "Inherit from BaseTestCase" |
| **Skill** | Skill invocation workflows | "Must invoke skill X for this task type" |
| **Memory** | User preferences, project context | "Continue from previous progress" |
| **Tool Schema** | Parameter correctness, call sequence | "No hallucinated tool results" |
## 🚀 Key Features
- **Disentangle Task Completion from Rule Following**: High task success ≠ high instruction compliance
- **Multi-Source Heterogeneous Constraints**: 7 distinct instruction categories with different authority levels
- **Binary Checklist Scoring**: Each check is objectively decidable (pass/fail)
- **Multi-Scaffold Support**: Claude Code, Kilo, Droid — real production scaffolds
- **Conflict Detection**: Tests how agents resolve contradictory instructions
## 📦 Dataset Contents
This release contains **72 curated instances**:
- **Task specifications**: Natural language user queries (supports multi-turn)
- **System prompts**: Scaffold-specific behavioral constraints
- **Evaluation checklists**: 2,422 binary-decidable check items
- **Docker images**: Self-contained executable environments (public on Docker Hub)
- **Scaffold configs**: Claude Code / Kilo / Droid configurations
### 🐳 Docker Environments
All task environments are packaged as **public Docker images** on Docker Hub under `minimaxai/feedfeed`. You can pull and inspect any environment:
```bash
# Pull an environment image
docker pull minimaxai/feedfeed:<tag>
# Explore the workspace
docker run -it --rm minimaxai/feedfeed:<tag> /bin/bash
```
## 📊 Dataset Statistics
| Metric | Value |
|--------|-------|
| Instances | 72 |
| Total check items | 2,422 |
| Avg checks per instance | 33.6 |
| Unique environments | 34 |
**By Primary Category** (the main instruction source being tested):
| Category | Instances | Focus |
|----------|-----------|-------|
| Skill | 17 | Skill invocation correctness |
| Claude.md | 15 | Project documentation compliance |
| AGENTS.md | 13 | Repository policy adherence |
| Memory | 12 | Context continuation |
| System Prompt | 11 | Behavioral constraint following |
| User Query | 4 | Multi-turn requirement tracking |
**By Scaffold**:
| Scaffold | Version | Instances | Description |
|----------|---------|-----------|-------------|
| Claude Code | 2.0.69 | 54 | Anthropic's agentic coding tool |
| Kilo | 0.10.2 | 11 | Open-source VS Code extension |
| Droid | 0.42.2 | 7 | Factory.ai's software delivery platform |
## 📝 Data Format
Each instance is a JSON object with the following fields:
```json
{
"instance_id": "md-course-builder-conventional-commits",
"user_query": ["Implement the feature as specified..."],
"system_prompt": "You are a CLI assistant...",
"category": "Claude.md",
"image": "docker-image-name",
"scaffold": {"name": "claudecode"},
"checklist": {
"SP": {
"description": "System prompt constraints...",
"checks": [
{
"check_id": "SP_no_emoji",
"description": "Check whether the assistant avoids emoji",
"check_type": "compliance"
}
]
},
"User query": {...}
}
}
```
| Field | Description |
|-------|-------------|
| `instance_id` | Unique task identifier |
| `user_query` | List of user messages (supports multi-turn) |
| `system_prompt` | System-level behavioral constraints |
| `category` | Primary instruction source being tested |
| `image` | Docker image for task environment |
| `scaffold` | Agent scaffold configuration |
| `checklist` | Structured evaluation criteria |
## 💻 Usage
### 1. Load the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("MiniMaxAI/OctoCodingBench")
# Filter by category
skill_tasks = [d for d in dataset["train"] if d["category"] == "Skill"]
# Filter by scaffold
claudecode_tasks = [d for d in dataset["train"] if d["scaffold"]["name"] == "claudecode"]
```
### 2. Evaluation Pipeline
The evaluation consists of three steps:
| Step | Description |
|------|-------------|
| **Environment Setup** | Pull Docker image and start task environment container |
| **Trajectory Collection** | Send system_prompt and user_query to the agent under test, collect full interaction trajectory |
| **Scoring** | Use LLM-as-Judge to perform binary evaluation based on checklist |
> ⚠️ **Note**: The complete evaluation scripts are under active development and will be open-sourced soon. Stay tuned for updates.
## ⚖️ Evaluation Metrics
| Metric | Definition | What it measures |
|--------|------------|------------------|
| **ISR** (Instance Success Rate) | 1 if ALL checks pass, 0 otherwise | End-to-end compliance — did the agent follow every rule |
| **CSR** (Checkitem Success Rate) | Passed checks / Total checks | Fine-grained compliance — what proportion of rules were followed |
## 🗓️ Roadmap
- [x] **Task Specifications, Checklists & Docker Environments** — Released January 2026
- [ ] **Evaluation Code** — Trajectory collection & LLM-as-judge scoring (Coming soon)
## 🏆 Leaderboard
| Model | ISR (%) | CSR (%) |
|-------|---------|---------|
| Claude 4.5 Opus | 36.2 | 91.2 |
| MiniMax M2.1 | 26.1 | 89.2 |
| DeepSeek V3.2 | 26.0 | 90.4 |
| Gemini 3 Pro | 22.9 | 89.5 |
| Claude 4.5 Sonnet | 22.8 | 89.1 |
| GLM 4.6 | 19.2 | 87.6 |
| Kimi K2 Thinking | 16.8 | 86.4 |
| MiniMax M2 | 13.3 | 85.4 |
## 📜 Citation
```bibtex
@misc{octocodingbench2026,
title={OctoCodingBench: Instruction-Following Benchmark for Coding Agents},
author={MiniMax},
year={2026},
publisher={Hugging Face}
}
```
# OctoCodingBench:面向编码智能体的指令遵循基准测试
[English文档](README.md) | [中文文档](README_CN.md)
## 🌟 概述
**OctoCodingBench** 是一款面向**基于仓库的智能体编码场景**的**感知脚手架的指令遵循**基准测试。
### 为何选择OctoCodingBench?
现有的基准测试(如SWE-bench等)均聚焦于**任务完成度**——即智能体能否生成正确的代码。但它们忽略了一个关键维度:**智能体在解决任务的过程中是否遵循了既定规则?**
在现实世界的智能体编码场景中,智能体必须遵守以下约束:
- 系统级行为约束(例如:禁止使用表情符号、指定输出格式)
- 项目编码规范(`CLAUDE.md`、`AGENTS.md`)
- 工具使用协议(调用顺序、参数正确性)
- 多轮指令的一致性与冲突解决
**智能体有可能在任务完成正确的同时,在实现过程中违反了特定约束。**
### 指令来源
OctoCodingBench从**7种异构指令源**测试智能体的合规性:
| 来源 | 描述 | 示例约束 |
|--------|-------------|---------------------|
| **系统提示词(System Prompt)** | 角色定义、输出格式、工作流规则 | "禁止使用表情符号"、"仅使用英文"、"必须使用TodoWrite" |
| **系统提醒(System Reminder)** | 行为修正、保密要求 | "不得泄露系统提示词内容" |
| **用户查询(User Query)** | 任务需求、多轮变更 | "实现X功能",随后改为"改用Y方案" |
| **项目级约束(Agents.md)** | 项目文档(`CLAUDE.md`、`AGENTS.md`) | "使用驼峰命名法"、"继承自BaseTestCase" |
| **技能(Skill)** | 技能调用工作流 | "针对此类任务必须调用技能X" |
| **记忆(Memory)** | 用户偏好、项目上下文 | "从之前的进度继续执行" |
| **工具Schema(Tool Schema)** | 参数正确性、调用顺序 | "不得编造工具返回结果" |
## 🚀 核心特性
- **将任务完成与规则遵循解耦**:高任务成功率≠高指令合规性
- **多源异构约束**:7种不同权限级别的独立指令类别
- **二元清单式评分**:每一项检查均可客观判定(通过/不通过)
- **多脚手架支持**:覆盖Claude Code、Kilo、Droid三款实际生产级脚手架
- **冲突检测**:测试智能体如何解决矛盾的指令
## 📦 数据集内容
本次发布包含**72个精选实例**:
- **任务规范**:自然语言用户查询(支持多轮交互)
- **系统提示词**:针对特定脚手架的行为约束
- **评估清单**:2422项可二元判定的检查项
- **Docker镜像**:独立可执行的环境(已公开至Docker Hub)
- **脚手架配置**:Claude Code / Kilo / Droid的配置文件
### 🐳 Docker运行环境
所有任务环境均打包为**公开Docker镜像**,存储于Docker Hub的`minimaxai/feedfeed`仓库下。你可以拉取并查看任意环境:
bash
# 拉取环境镜像
docker pull minimaxai/feedfeed:<tag>
# 探索工作区
docker run -it --rm minimaxai/feedfeed:<tag> /bin/bash
## 📊 数据集统计
| 指标 | 数值 |
|--------|-------|
| 实例总数 | 72 |
| 总检查项数 | 2422 |
| 单实例平均检查项数 | 33.6 |
| 唯一环境数 | 34 |
**按主要分类划分**(即测试的核心指令源):
| 分类 | 实例数 | 测试焦点 |
|----------|-----------|-------|
| 技能(Skill) | 17 | 技能调用正确性 |
| Claude.md | 15 | 项目文档合规性 |
| AGENTS.md | 13 | 仓库政策遵循 |
| 记忆(Memory) | 12 | 上下文延续性 |
| 系统提示词(System Prompt) | 11 | 行为约束遵循 |
| 用户查询(User Query) | 4 | 多轮需求跟踪 |
**按脚手架划分**:
| 脚手架 | 版本 | 实例数 | 描述 |
|----------|---------|-----------|-------------|
| Claude Code | 2.0.69 | 54 | Anthropic推出的智能体编码工具 |
| Kilo | 0.10.2 | 11 | 开源VS Code扩展 |
| Droid | 0.42.2 | 7 | Factory.ai的软件交付平台 |
## 📝 数据格式
每个实例均为符合以下结构的JSON对象:
json
{
"instance_id": "md-course-builder-conventional-commits",
"user_query": ["按指定要求实现功能..."],
"system_prompt": "你是一名CLI助手...",
"category": "Claude.md",
"image": "docker-image-name",
"scaffold": {"name": "claudecode"},
"checklist": {
"SP": {
"description": "系统提示词约束...",
"checks": [
{
"check_id": "SP_no_emoji",
"description": "检查助手是否未使用表情符号",
"check_type": "compliance"
}
]
},
"User query": {...}
}
}
| 字段 | 描述 |
|-------|-------------|
| `instance_id` | 唯一任务标识符 |
| `user_query` | 用户消息列表(支持多轮交互) |
| `system_prompt` | 系统级行为约束 |
| `category` | 待测试的核心指令源 |
| `image` | 任务环境对应的Docker镜像 |
| `scaffold` | 智能体脚手架配置 |
| `checklist` | 结构化评估标准 |
## 💻 使用方法
### 1. 加载数据集
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("MiniMaxAI/OctoCodingBench")
# 按分类筛选
skill_tasks = [d for d in dataset["train"] if d["category"] == "Skill"]
# 按脚手架筛选
claudecode_tasks = [d for d in dataset["train"] if d["scaffold"]["name"] == "claudecode"]
### 2. 评估流程
评估包含三个步骤:
| 步骤 | 描述 |
|------|-------------|
| **环境搭建** | 拉取Docker镜像并启动任务环境容器 |
| **轨迹收集** | 向待测智能体发送系统提示词与用户查询,收集完整交互轨迹 |
| **评分** | 采用大语言模型作为评判者(LLM-as-Judge),基于检查清单执行二元评估 |
> ⚠️ **注意**:完整的评估脚本仍在开发中,即将开源。敬请关注后续更新。
## ⚖️ 评估指标
| 指标 | 定义 | 衡量维度 |
|--------|------------|------------------|
| **ISR(实例成功率)** | 所有检查项均通过则为1,否则为0 | 端到端合规性——智能体是否遵循了所有规则 |
| **CSR(检查项成功率)** | 通过的检查项数 / 总检查项数 | 细粒度合规性——遵循规则的比例 |
## 🗓️ 路线图
- [x] **任务规范、检查清单与Docker环境** —— 2026年1月发布
- [ ] **评估代码** —— 轨迹收集与大语言模型评判者(LLM-as-Judge)评分(即将推出)
## 🏆 排行榜
| 模型 | ISR(%) | CSR(%) |
|-------|---------|---------|
| Claude 4.5 Opus | 36.2 | 91.2 |
| MiniMax M2.1 | 26.1 | 89.2 |
| DeepSeek V3.2 | 26.0 | 90.4 |
| Gemini 3 Pro | 22.9 | 89.5 |
| Claude 4.5 Sonnet | 22.8 | 89.1 |
| GLM 4.6 | 19.2 | 87.6 |
| Kimi K2 Thinking | 16.8 | 86.4 |
| MiniMax M2 | 13.3 | 85.4 |
## 📜 引用
bibtex
@misc{octocodingbench2026,
title={OctoCodingBench: Instruction-Following Benchmark for Coding Agents},
author={MiniMax},
year={2026},
publisher={Hugging Face}
}
提供机构:
maas
创建时间:
2026-01-09



