AmanPriyanshu/tool-reasoning-sft-CODING-allenai-SERA-data-cleaned-rectified
收藏Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-CODING-allenai-SERA-data-cleaned-rectified
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
tags:
- reasoning
- tool-calling
- agentic
- multi-turn
- coding
- swe-bench
- software-engineering
size_categories:
- 100K<n<1M
---
# SERA — Consolidated & Rectified
211,360 multi-turn SWE-agent coding trajectories from the SERA (Soft-Verified Efficient Repository Agents) project, consolidated from 4 source datasets into a single file with strict reasoning + tool-call format and validated FSM transitions.
## Origin
Derived from Allen AI's Open Coding Agents release:
| Source Dataset | Rows | Teacher | Scale | Rollout |
|---|---|---|---|---|
| [allenai/Sera-4.5A-Full-T1](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T1) | 72,118 | GLM-4.5-Air | full | T1 |
| [allenai/Sera-4.5A-Full-T2](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T2) | 66,337 | GLM-4.5-Air | full | T2 |
| [allenai/Sera-4.6-Lite-T1](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T1) | 36,825 | GLM-4.6 | lite | T1 |
| [allenai/Sera-4.6-Lite-T2](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2) | 36,083 | GLM-4.6 | lite | T2 |
SERA uses **Soft Verified Generation (SVG)**, a two-rollout pipeline where a teacher model first makes a change to a codebase (T1), then attempts to reproduce that change from only a PR description (T2). Patches are compared using line-level recall for quality scoring — no test execution required.
**SERA-32B** (49.5% on SWE-bench Verified at 32K context) was trained on a 25,000-row subset of `Sera-4.6-Lite-T2` using standard SFT. Total training cost: ~$2,000.
📄 **Paper:** [SERA: Soft-Verified Efficient Repository Agents](https://arxiv.org/abs/2601.20789)
🔗 **Code:** [github.com/allenai/SERA](https://github.com/allenai/SERA)
## Format
Each row contains a structured multi-turn coding agent trajectory with native reasoning traces and validated tool calls.
### Message Roles
| Role | Content |
|---|---|
| `system` | Tool-use protocol + JSON tool schemas + SWE-agent instructions |
| `user` | Repository description + PR description + task instructions |
| `reasoning` | `<think>…</think>` — model's step-by-step reasoning (native, not synthesized) |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation |
| `tool_output` | `<tool_response>…</tool_response>` — tool execution result |
| `answer` | `<answer>…</answer>` — final submission |
### Trajectory Structure
```
system → user → reasoning → [tool_call → tool_output → reasoning →]* answer
```
Trajectories range from 43 to 340 turns (avg 130.6), with 13–112 tool calls per row (avg 42.2).
## Schema
Single Parquet file with zstd compression.
| Column | Type | Description |
|---|---|---|
| `messages` | string | Converted trajectory (JSON list of `{role, content}`) |
| `instance_id` | string | Original trajectory ID |
| `teacher` | string | Teacher model: `GLM-4.5-Air` or `GLM-4.6` |
| `scale` | string | Generation scale: `full` (3 runs/function) or `lite` (1 run/function) |
| `rollout` | string | SVG stage: `T1` (initial change) or `T2` (reproduce from PR) |
| `func_name` | string | Function sampled from codebase to start the pipeline |
| `func_path` | string | File path to the sampled function |
| `line_level_recall` | float64 | Soft verification score (T2 only, null for T1) |
## Data Distribution
| Teacher | Scale | Rollout | Rows |
|---|---|---|---|
| GLM-4.5-Air | full | T1 | 72,118 |
| GLM-4.5-Air | full | T2 | 66,337 |
| GLM-4.6 | lite | T1 | 36,824 |
| GLM-4.6 | lite | T2 | 36,081 |
| **Total** | | | **211,360** |
## Tools
3 SWE-agent tools available in every trajectory:
- **str_replace_editor** — file viewer/editor (view, create, str_replace, undo_edit)
- **bash** — terminal command execution
- **submit** — solution submission (converted to `answer` in canonical format)
## Conversion Details
- **Native reasoning preserved**: the `thought` field on assistant messages (containing `<think>...</think>` blocks) is used as the authoritative source for reasoning content. No duplication from the `content` field which contains the same text.
- **OpenAI-style `tool_calls`** with JSON-string arguments parsed into canonical `{"name", "arguments": dict}` format.
- **Submit actions** converted to `reasoning → answer` pairs rather than tool_call/tool_output cycles.
- **Trailing duplicate submits** trimmed: models sometimes call `submit` 2–5 times at the end of a trajectory — only the first is kept.
- **Mid-trajectory text responses** merged: when the model emits a text-only response (reasoning → answer) then continues with more tool calls, the answer is folded back into reasoning to maintain valid FSM transitions.
- **Empty reasoning** filled from a pool of 12 template variations when the source message had no thought content.
- **99.999% conversion rate** (211,360 / 211,363 source rows, 3 dropped due to deeply nested mid-trajectory answer patterns).
- All 4 source datasets use identical message structure — one converter handles all of them.
## Filtering Guide
The metadata columns enable targeted filtering:
```python
import pyarrow.parquet as pq
t = pq.read_table("data.parquet")
# Only T2 trajectories (what SERA-32B was trained on)
t2 = t.filter(pc.field("rollout") == "T2")
# Only high-quality verified trajectories
verified = t.filter(pc.field("line_level_recall") > 0.75)
# Only GLM-4.6 teacher (stronger model)
glm46 = t.filter(pc.field("teacher") == "GLM-4.6")
```
## Usage
```py
import json, random
from datasets import load_dataset
ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-allenai-SERA-data-cleaned-rectified", split="train")
print(f"Loaded: {len(ds):,} rows\n")
idx = random.randint(0, len(ds) - 1)
row = ds[idx]
msgs = json.loads(row["messages"])
print(f"Row {idx} | teacher={row['teacher']} | scale={row['scale']} | rollout={row['rollout']} | {len(msgs)} turns")
print(f"instance_id: {row['instance_id']}")
print(f"func_name: {row['func_name']}")
print(f"func_path: {row['func_path']}")
print(f"line_recall: {row['line_level_recall']}")
print(f"Roles: {' -> '.join(m['role'] for m in msgs[:20])}{'...' if len(msgs)>20 else ''}\n")
for m in msgs:
content = m["content"]
if m["role"] == "system":
content = content[:200] + "..."
elif len(content) > 300:
content = content[:300] + "..."
print(f"[{m['role']}]\n{content}\n")
```
## License
This dataset is licensed under the **Open Data Commons Attribution License v1.0 (ODC-By)**, consistent with the source datasets. It is intended for research and educational use and may be used commercially with attribution.
## Citation
```bibtex
@misc{shen2026sera,
title={SERA: Soft-Verified Efficient Repository Agents},
author={Ethan Shen and Danny Tormoen and Saurabh Shah and Ali Farhadi and Tim Dettmers},
year={2026},
eprint={2601.20789},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.20789},
}
```
---
许可证:odc-by
任务类别:
- 文本生成
语言:
- 英语
标签:
- 推理
- 工具调用
- 智能体
- 多轮
- 编码
- SWE-bench
- 软件工程
规模类别:
- 100000 < 样本数 < 1000000
---
# SERA — 整合校正版
211,360条多轮SWE-agent(SWE-agent)编码轨迹,源自SERA(软验证高效仓库智能体,Soft-Verified Efficient Repository Agents)项目,由4个源数据集整合为单个文件,采用严格的推理+工具调用格式,并经过有限状态机(Finite State Machine, FSM)转换验证。
## 数据集起源
本数据集源自艾伦人工智能研究所(Allen AI)的开放编码智能体发布成果,其源数据集详情如下表所示:
| 源数据集 | 样本数 | 教师模型 | 生成规模 | 滚动阶段 |
|---|---|---|---|---|
| [allenai/Sera-4.5A-Full-T1](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T1) | 72,118 | GLM-4.5-Air | 完整 | T1 |
| [allenai/Sera-4.5A-Full-T2](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T2) | 66,337 | GLM-4.5-Air | 完整 | T2 |
| [allenai/Sera-4.6-Lite-T1](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T1) | 36,825 | GLM-4.6 | 轻量 | T1 |
| [allenai/Sera-4.6-Lite-T2](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2) | 36,083 | GLM-4.6 | 轻量 | T2 |
SERA采用**软验证生成(Soft Verified Generation, SVG)**技术,这是一种双滚动流水线:教师模型首先对代码库进行修改(T1阶段),随后仅基于拉取请求(Pull Request, PR)描述复现该修改(T2阶段)。补丁质量通过行级召回率进行评分——无需执行测试用例。
**SERA-32B**(在32K上下文长度下的SWE-bench Verified基准上取得49.5%的成绩)是在`Sera-4.6-Lite-T2`的25,000行子集上通过标准监督微调(Supervised Fine-Tuning, SFT)训练得到的,训练总成本约2000美元。
📄 **论文:** [SERA: Soft-Verified Efficient Repository Agents](https://arxiv.org/abs/2601.20789)
🔗 **代码:** [github.com/allenai/SERA](https://github.com/allenai/SERA)
## 数据格式
每条样本均包含结构化的多轮编码智能体轨迹,带有原生推理痕迹与经过验证的工具调用。
### 消息角色
| 角色 | 内容说明 |
|---|---|
| `system` | 工具使用协议 + JSON工具模式 + SWE-agent指令 |
| `user` | 仓库描述 + PR描述 + 任务指令 |
| `reasoning` | `<think>…</think>` —— 模型的逐步推理过程(原生生成,非人工合成) |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` —— 函数调用指令 |
| `tool_output` | `<tool_response>…</tool_response>` —— 工具执行结果 |
| `answer` | `<answer>…</answer>` —— 最终提交的解决方案 |
### 轨迹结构
system → user → reasoning → [tool_call → tool_output → reasoning →]* answer
轨迹轮次范围为43至340轮(平均130.6轮),每条轨迹包含13至112次工具调用(平均42.2次)。
## 数据Schema
本数据集为单个采用zstd压缩的Parquet(Parquet)文件,各字段详情如下表所示:
| 字段名 | 数据类型 | 说明 |
|---|---|---|
| `messages` | 字符串 | 转换后的轨迹数据,格式为`{role, content}`的JSON列表 |
| `instance_id` | 字符串 | 原始轨迹ID |
| `teacher` | 字符串 | 教师模型:`GLM-4.5-Air` 或 `GLM-4.6` |
| `scale` | 字符串 | 生成规模:`full`(每个函数执行3次运行)或 `lite`(每个函数执行1次运行) |
| `rollout` | 字符串 | SVG阶段:`T1`(初始代码修改)或 `T2`(基于PR复现修改) |
| `func_name` | 字符串 | 从代码库中采样的用于启动流水线的函数名 |
| `func_path` | 字符串 | 采样函数的文件路径 |
| `line_level_recall` | float64 | 软验证分数(仅T2阶段有效,T1阶段为null) |
## 数据分布
各子集的样本分布如下表所示:
| 教师模型 | 生成规模 | 滚动阶段 | 样本数 |
|---|---|---|---|
| GLM-4.5-Air | full | T1 | 72,118 |
| GLM-4.5-Air | full | T2 | 66,337 |
| GLM-4.6 | lite | T1 | 36,824 |
| GLM-4.6 | lite | T2 | 36,081 |
| **总计** | | | **211,360** |
## 可用工具
每条轨迹均包含3种SWE-agent工具:
- **str_replace_editor** —— 文件查看/编辑器(支持查看、创建、字符串替换、撤销编辑操作)
- **bash** —— 终端命令执行工具
- **submit** —— 解决方案提交工具(转换为规范格式的`answer`字段)
## 转换细节
1. **保留原生推理痕迹**:助手消息中的`thought`字段(包含`<think>...</think>`块)作为推理内容的权威来源,不重复使用包含相同文本的`content`字段。
2. **格式标准化**:将OpenAI风格的带JSON字符串参数的`tool_calls`解析为规范的`{"name", "arguments": dict}`格式。
3. **提交动作转换**:将`submit`工具调用转换为`reasoning → answer`对,而非保留工具调用/工具输出的循环结构。
4. **修剪末尾重复提交**:模型有时会在轨迹末尾重复2-5次调用`submit`,仅保留首次提交。
5. **合并中间文本响应**:当模型先发送纯文本响应(推理→回答),随后继续进行工具调用时,将回答内容合并回推理内容,以维持有效的有限状态机转换。
6. **填充空推理字段**:当源消息无思考内容时,从12种预设模板变体中选取内容填充空的推理字段。
7. **高转换率**:转换成功率达99.999%(211,360 / 211,363条源数据,3条因存在深度嵌套的轨迹中间回答模式被丢弃)。
8. **统一转换逻辑**:所有4个源数据集采用完全一致的消息结构,仅需一套转换器即可完成全部数据集的转换。
## 筛选指南
可通过元数据字段进行定向筛选,示例代码如下:
python
import pyarrow.parquet as pq
t = pq.read_table("data.parquet")
# 仅筛选T2阶段轨迹(即SERA-32B的训练数据)
t2 = t.filter(pc.field("rollout") == "T2")
# 仅筛选高质量验证轨迹
verified = t.filter(pc.field("line_level_recall") > 0.75)
# 仅筛选使用GLM-4.6作为教师模型的样本
glm46 = t.filter(pc.field("teacher") == "GLM-4.6")
## 使用示例
py
import json, random
from datasets import load_dataset
ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-allenai-SERA-data-cleaned-rectified", split="train")
print(f"已加载:{len(ds):,} 条样本
")
idx = random.randint(0, len(ds) - 1)
row = ds[idx]
msgs = json.loads(row["messages"])
print(f"样本 {idx} | 教师模型={row['teacher']} | 生成规模={row['scale']} | 滚动阶段={row['rollout']} | 总轮次={len(msgs)}")
print(f"instance_id: {row['instance_id']}")
print(f"func_name: {row['func_name']}")
print(f"func_path: {row['func_path']}")
print(f"行级召回率: {row['line_level_recall']}")
print(f"角色序列: {' -> '.join(m['role'] for m in msgs[:20])}{'...' if len(msgs)>20 else ''}
")
for m in msgs:
content = m["content"]
if m["role"] == "system":
content = content[:200] + "..."
elif len(content) > 300:
content = content[:300] + "..."
print(f"[{m['role']}]
{content}
")
## 许可证
本数据集采用**开放数据通用署名许可协议v1.0(Open Data Commons Attribution License v1.0, ODC-By)**,与源数据集保持一致。本数据集仅可用于研究与教育用途,如需商业使用需注明原作者。
## 引用格式
bibtex
@misc{shen2026sera,
title={SERA: Soft-Verified Efficient Repository Agents},
author={Ethan Shen and Danny Tormoen and Saurabh Shah and Ali Farhadi and Tim Dettmers},
year={2026},
eprint={2601.20789},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.20789},
}
提供机构:
AmanPriyanshu



