five

AmanPriyanshu/tool-reasoning-sft-CODING-allenai-SERA-data-cleaned-rectified

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-CODING-allenai-SERA-data-cleaned-rectified
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en tags: - reasoning - tool-calling - agentic - multi-turn - coding - swe-bench - software-engineering size_categories: - 100K<n<1M --- # SERA — Consolidated & Rectified 211,360 multi-turn SWE-agent coding trajectories from the SERA (Soft-Verified Efficient Repository Agents) project, consolidated from 4 source datasets into a single file with strict reasoning + tool-call format and validated FSM transitions. ## Origin Derived from Allen AI's Open Coding Agents release: | Source Dataset | Rows | Teacher | Scale | Rollout | |---|---|---|---|---| | [allenai/Sera-4.5A-Full-T1](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T1) | 72,118 | GLM-4.5-Air | full | T1 | | [allenai/Sera-4.5A-Full-T2](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T2) | 66,337 | GLM-4.5-Air | full | T2 | | [allenai/Sera-4.6-Lite-T1](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T1) | 36,825 | GLM-4.6 | lite | T1 | | [allenai/Sera-4.6-Lite-T2](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2) | 36,083 | GLM-4.6 | lite | T2 | SERA uses **Soft Verified Generation (SVG)**, a two-rollout pipeline where a teacher model first makes a change to a codebase (T1), then attempts to reproduce that change from only a PR description (T2). Patches are compared using line-level recall for quality scoring — no test execution required. **SERA-32B** (49.5% on SWE-bench Verified at 32K context) was trained on a 25,000-row subset of `Sera-4.6-Lite-T2` using standard SFT. Total training cost: ~$2,000. 📄 **Paper:** [SERA: Soft-Verified Efficient Repository Agents](https://arxiv.org/abs/2601.20789) 🔗 **Code:** [github.com/allenai/SERA](https://github.com/allenai/SERA) ## Format Each row contains a structured multi-turn coding agent trajectory with native reasoning traces and validated tool calls. ### Message Roles | Role | Content | |---|---| | `system` | Tool-use protocol + JSON tool schemas + SWE-agent instructions | | `user` | Repository description + PR description + task instructions | | `reasoning` | `<think>…</think>` — model's step-by-step reasoning (native, not synthesized) | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation | | `tool_output` | `<tool_response>…</tool_response>` — tool execution result | | `answer` | `<answer>…</answer>` — final submission | ### Trajectory Structure ``` system → user → reasoning → [tool_call → tool_output → reasoning →]* answer ``` Trajectories range from 43 to 340 turns (avg 130.6), with 13–112 tool calls per row (avg 42.2). ## Schema Single Parquet file with zstd compression. | Column | Type | Description | |---|---|---| | `messages` | string | Converted trajectory (JSON list of `{role, content}`) | | `instance_id` | string | Original trajectory ID | | `teacher` | string | Teacher model: `GLM-4.5-Air` or `GLM-4.6` | | `scale` | string | Generation scale: `full` (3 runs/function) or `lite` (1 run/function) | | `rollout` | string | SVG stage: `T1` (initial change) or `T2` (reproduce from PR) | | `func_name` | string | Function sampled from codebase to start the pipeline | | `func_path` | string | File path to the sampled function | | `line_level_recall` | float64 | Soft verification score (T2 only, null for T1) | ## Data Distribution | Teacher | Scale | Rollout | Rows | |---|---|---|---| | GLM-4.5-Air | full | T1 | 72,118 | | GLM-4.5-Air | full | T2 | 66,337 | | GLM-4.6 | lite | T1 | 36,824 | | GLM-4.6 | lite | T2 | 36,081 | | **Total** | | | **211,360** | ## Tools 3 SWE-agent tools available in every trajectory: - **str_replace_editor** — file viewer/editor (view, create, str_replace, undo_edit) - **bash** — terminal command execution - **submit** — solution submission (converted to `answer` in canonical format) ## Conversion Details - **Native reasoning preserved**: the `thought` field on assistant messages (containing `<think>...</think>` blocks) is used as the authoritative source for reasoning content. No duplication from the `content` field which contains the same text. - **OpenAI-style `tool_calls`** with JSON-string arguments parsed into canonical `{"name", "arguments": dict}` format. - **Submit actions** converted to `reasoning → answer` pairs rather than tool_call/tool_output cycles. - **Trailing duplicate submits** trimmed: models sometimes call `submit` 2–5 times at the end of a trajectory — only the first is kept. - **Mid-trajectory text responses** merged: when the model emits a text-only response (reasoning → answer) then continues with more tool calls, the answer is folded back into reasoning to maintain valid FSM transitions. - **Empty reasoning** filled from a pool of 12 template variations when the source message had no thought content. - **99.999% conversion rate** (211,360 / 211,363 source rows, 3 dropped due to deeply nested mid-trajectory answer patterns). - All 4 source datasets use identical message structure — one converter handles all of them. ## Filtering Guide The metadata columns enable targeted filtering: ```python import pyarrow.parquet as pq t = pq.read_table("data.parquet") # Only T2 trajectories (what SERA-32B was trained on) t2 = t.filter(pc.field("rollout") == "T2") # Only high-quality verified trajectories verified = t.filter(pc.field("line_level_recall") > 0.75) # Only GLM-4.6 teacher (stronger model) glm46 = t.filter(pc.field("teacher") == "GLM-4.6") ``` ## Usage ```py import json, random from datasets import load_dataset ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-allenai-SERA-data-cleaned-rectified", split="train") print(f"Loaded: {len(ds):,} rows\n") idx = random.randint(0, len(ds) - 1) row = ds[idx] msgs = json.loads(row["messages"]) print(f"Row {idx} | teacher={row['teacher']} | scale={row['scale']} | rollout={row['rollout']} | {len(msgs)} turns") print(f"instance_id: {row['instance_id']}") print(f"func_name: {row['func_name']}") print(f"func_path: {row['func_path']}") print(f"line_recall: {row['line_level_recall']}") print(f"Roles: {' -> '.join(m['role'] for m in msgs[:20])}{'...' if len(msgs)>20 else ''}\n") for m in msgs: content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{m['role']}]\n{content}\n") ``` ## License This dataset is licensed under the **Open Data Commons Attribution License v1.0 (ODC-By)**, consistent with the source datasets. It is intended for research and educational use and may be used commercially with attribution. ## Citation ```bibtex @misc{shen2026sera, title={SERA: Soft-Verified Efficient Repository Agents}, author={Ethan Shen and Danny Tormoen and Saurabh Shah and Ali Farhadi and Tim Dettmers}, year={2026}, eprint={2601.20789}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.20789}, } ```

--- 许可证:odc-by 任务类别: - 文本生成 语言: - 英语 标签: - 推理 - 工具调用 - 智能体 - 多轮 - 编码 - SWE-bench - 软件工程 规模类别: - 100000 < 样本数 < 1000000 --- # SERA — 整合校正版 211,360条多轮SWE-agent(SWE-agent)编码轨迹,源自SERA(软验证高效仓库智能体,Soft-Verified Efficient Repository Agents)项目,由4个源数据集整合为单个文件,采用严格的推理+工具调用格式,并经过有限状态机(Finite State Machine, FSM)转换验证。 ## 数据集起源 本数据集源自艾伦人工智能研究所(Allen AI)的开放编码智能体发布成果,其源数据集详情如下表所示: | 源数据集 | 样本数 | 教师模型 | 生成规模 | 滚动阶段 | |---|---|---|---|---| | [allenai/Sera-4.5A-Full-T1](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T1) | 72,118 | GLM-4.5-Air | 完整 | T1 | | [allenai/Sera-4.5A-Full-T2](https://huggingface.co/datasets/allenai/Sera-4.5A-Full-T2) | 66,337 | GLM-4.5-Air | 完整 | T2 | | [allenai/Sera-4.6-Lite-T1](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T1) | 36,825 | GLM-4.6 | 轻量 | T1 | | [allenai/Sera-4.6-Lite-T2](https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2) | 36,083 | GLM-4.6 | 轻量 | T2 | SERA采用**软验证生成(Soft Verified Generation, SVG)**技术,这是一种双滚动流水线:教师模型首先对代码库进行修改(T1阶段),随后仅基于拉取请求(Pull Request, PR)描述复现该修改(T2阶段)。补丁质量通过行级召回率进行评分——无需执行测试用例。 **SERA-32B**(在32K上下文长度下的SWE-bench Verified基准上取得49.5%的成绩)是在`Sera-4.6-Lite-T2`的25,000行子集上通过标准监督微调(Supervised Fine-Tuning, SFT)训练得到的,训练总成本约2000美元。 📄 **论文:** [SERA: Soft-Verified Efficient Repository Agents](https://arxiv.org/abs/2601.20789) 🔗 **代码:** [github.com/allenai/SERA](https://github.com/allenai/SERA) ## 数据格式 每条样本均包含结构化的多轮编码智能体轨迹,带有原生推理痕迹与经过验证的工具调用。 ### 消息角色 | 角色 | 内容说明 | |---|---| | `system` | 工具使用协议 + JSON工具模式 + SWE-agent指令 | | `user` | 仓库描述 + PR描述 + 任务指令 | | `reasoning` | `<think>…</think>` —— 模型的逐步推理过程(原生生成,非人工合成) | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` —— 函数调用指令 | | `tool_output` | `<tool_response>…</tool_response>` —— 工具执行结果 | | `answer` | `<answer>…</answer>` —— 最终提交的解决方案 | ### 轨迹结构 system → user → reasoning → [tool_call → tool_output → reasoning →]* answer 轨迹轮次范围为43至340轮(平均130.6轮),每条轨迹包含13至112次工具调用(平均42.2次)。 ## 数据Schema 本数据集为单个采用zstd压缩的Parquet(Parquet)文件,各字段详情如下表所示: | 字段名 | 数据类型 | 说明 | |---|---|---| | `messages` | 字符串 | 转换后的轨迹数据,格式为`{role, content}`的JSON列表 | | `instance_id` | 字符串 | 原始轨迹ID | | `teacher` | 字符串 | 教师模型:`GLM-4.5-Air` 或 `GLM-4.6` | | `scale` | 字符串 | 生成规模:`full`(每个函数执行3次运行)或 `lite`(每个函数执行1次运行) | | `rollout` | 字符串 | SVG阶段:`T1`(初始代码修改)或 `T2`(基于PR复现修改) | | `func_name` | 字符串 | 从代码库中采样的用于启动流水线的函数名 | | `func_path` | 字符串 | 采样函数的文件路径 | | `line_level_recall` | float64 | 软验证分数(仅T2阶段有效,T1阶段为null) | ## 数据分布 各子集的样本分布如下表所示: | 教师模型 | 生成规模 | 滚动阶段 | 样本数 | |---|---|---|---| | GLM-4.5-Air | full | T1 | 72,118 | | GLM-4.5-Air | full | T2 | 66,337 | | GLM-4.6 | lite | T1 | 36,824 | | GLM-4.6 | lite | T2 | 36,081 | | **总计** | | | **211,360** | ## 可用工具 每条轨迹均包含3种SWE-agent工具: - **str_replace_editor** —— 文件查看/编辑器(支持查看、创建、字符串替换、撤销编辑操作) - **bash** —— 终端命令执行工具 - **submit** —— 解决方案提交工具(转换为规范格式的`answer`字段) ## 转换细节 1. **保留原生推理痕迹**:助手消息中的`thought`字段(包含`<think>...</think>`块)作为推理内容的权威来源,不重复使用包含相同文本的`content`字段。 2. **格式标准化**:将OpenAI风格的带JSON字符串参数的`tool_calls`解析为规范的`{"name", "arguments": dict}`格式。 3. **提交动作转换**:将`submit`工具调用转换为`reasoning → answer`对,而非保留工具调用/工具输出的循环结构。 4. **修剪末尾重复提交**:模型有时会在轨迹末尾重复2-5次调用`submit`,仅保留首次提交。 5. **合并中间文本响应**:当模型先发送纯文本响应(推理→回答),随后继续进行工具调用时,将回答内容合并回推理内容,以维持有效的有限状态机转换。 6. **填充空推理字段**:当源消息无思考内容时,从12种预设模板变体中选取内容填充空的推理字段。 7. **高转换率**:转换成功率达99.999%(211,360 / 211,363条源数据,3条因存在深度嵌套的轨迹中间回答模式被丢弃)。 8. **统一转换逻辑**:所有4个源数据集采用完全一致的消息结构,仅需一套转换器即可完成全部数据集的转换。 ## 筛选指南 可通过元数据字段进行定向筛选,示例代码如下: python import pyarrow.parquet as pq t = pq.read_table("data.parquet") # 仅筛选T2阶段轨迹(即SERA-32B的训练数据) t2 = t.filter(pc.field("rollout") == "T2") # 仅筛选高质量验证轨迹 verified = t.filter(pc.field("line_level_recall") > 0.75) # 仅筛选使用GLM-4.6作为教师模型的样本 glm46 = t.filter(pc.field("teacher") == "GLM-4.6") ## 使用示例 py import json, random from datasets import load_dataset ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-allenai-SERA-data-cleaned-rectified", split="train") print(f"已加载:{len(ds):,} 条样本 ") idx = random.randint(0, len(ds) - 1) row = ds[idx] msgs = json.loads(row["messages"]) print(f"样本 {idx} | 教师模型={row['teacher']} | 生成规模={row['scale']} | 滚动阶段={row['rollout']} | 总轮次={len(msgs)}") print(f"instance_id: {row['instance_id']}") print(f"func_name: {row['func_name']}") print(f"func_path: {row['func_path']}") print(f"行级召回率: {row['line_level_recall']}") print(f"角色序列: {' -> '.join(m['role'] for m in msgs[:20])}{'...' if len(msgs)>20 else ''} ") for m in msgs: content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{m['role']}] {content} ") ## 许可证 本数据集采用**开放数据通用署名许可协议v1.0(Open Data Commons Attribution License v1.0, ODC-By)**,与源数据集保持一致。本数据集仅可用于研究与教育用途,如需商业使用需注明原作者。 ## 引用格式 bibtex @misc{shen2026sera, title={SERA: Soft-Verified Efficient Repository Agents}, author={Ethan Shen and Danny Tormoen and Saurabh Shah and Ali Farhadi and Tim Dettmers}, year={2026}, eprint={2601.20789}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.20789}, }
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作