AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- en
tags:
- reasoning
- tool-calling
- agentic
- multi-turn
- code-localization
- bash-tools
- mid-training
- reinforcement-learning
- software-engineering
- SWE-Bench
size_categories:
- 10K<n<100K
---
# CodeScout Training Rollouts — Cleaned & Rectified
~40K multi-turn code localization agent trajectories converted into a strict reasoning + tool-call format with validated FSM transitions. Supports coupled (parallel) tool calls.
> **⚠️ Mid-training dataset.** This dataset contains synthesized reasoning templates (not native chain-of-thought). It is suitable for **mid-training** to teach tool-use mechanics, FSM structure, and bash exploration patterns. It is **not recommended as a final SFT dataset** for reasoning capabilities — use datasets with native `<think>` blocks (e.g., REDSearcher, OpenSeeker) for that purpose.
## Origin
Derived from [OpenHands/CodeScout_Training_Rollouts](https://huggingface.co/datasets/OpenHands/CodeScout_Training_Rollouts) (both `CodeScout_14B` and `CodeScout_4B` configs).
### About CodeScout
CodeScout is a family of open-source RL-trained code search agents that achieve **state-of-the-art repository-level code localization** on SWE-Bench using nothing more than a standard Unix terminal — no static analysis, no repository graphs, no language-specific tooling. The models are trained with **GSPO (Group Sequence Policy Optimization)** using multi-level F1 rewards at the file, module, and function level.
**Key results (SWE-Bench Verified):**
| Model | File F1 | Function F1 |
|---|---|---|
| CodeScout-14B | 68.57 | 40.32 |
| CodeScout-4B | 68.52 | 36.78 |
| CodeScout-1.7B | 55.46 | 28.22 |
CodeScout-14B outperforms 2–18× larger base and post-trained LLMs across all benchmarks, surpasses GPT-5 and approaches Claude Sonnet 4.5 using RepoNavigator, and achieves 8–33% higher function-level F1 than Qwen3-32B (Thinking).
The training rollouts in this dataset come from 9,600 SWE-Smith instances across 128 repositories, with 4 rollouts per instance and up to 4 turns per episode.
📄 **Paper:** [CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents](https://arxiv.org/abs/2603.17829)
💻 **Code:** [OpenHands](https://github.com/All-Hands-AI/OpenHands)
🤗 **Models:** [CodeScout-14B](https://huggingface.co/OpenHands/CodeScout-14B) · [CodeScout-4B](https://huggingface.co/OpenHands/CodeScout-4B) · [CodeScout-1.7B](https://huggingface.co/OpenHands/CodeScout-1.7B)
## Format
Each row contains a structured multi-turn conversation with synthesized reasoning traces and validated tool calls.
### Message Roles
| Role | Content |
|---|---|
| `system` | Tool-use protocol + JSON tool schemas + code localization instructions |
| `user` | Issue description + repository path |
| `reasoning` | `<think>…</think>` — synthesized reasoning (template-based, not native) |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — one or more per turn |
| `tool_output` | `<tool_response>…</tool_response>` — one per tool_call, in matching order |
| `answer` | `<answer>…</answer>` — final confirmation |
### Coupled Tool Calls
This dataset introduces **coupled (parallel) tool calls**: a single `tool_call` turn may contain multiple `<tool_call>` blocks, with corresponding multiple `<tool_response>` blocks in the following `tool_output` turn:
```
tool_call:
<tool_call>{"name": "terminal", "arguments": {"command": "rg 'pattern1'"}}</tool_call>
<tool_call>{"name": "terminal", "arguments": {"command": "find . -name '*.py'"}}</tool_call>
tool_output:
<tool_response>...results for pattern1...</tool_response>
<tool_response>...results for find...</tool_response>
```
Roughly 60% of multi-call assistant messages are kept coupled; 40% are split into sequential turns with bridge reasoning.
### Trajectory Structure
```
system → user → reasoning → [tool_call → tool_output → reasoning →]* answer
```
## Schema
Single Parquet file with zstd compression.
| Column | Type | Description |
|---|---|---|
| `messages` | string | Converted conversation (JSON list of `{role, content}`) |
| `config` | string | Source config: `CodeScout_14B` or `CodeScout_4B` |
## Tools
2 tools available per trajectory:
| Tool | Description |
|---|---|
| `terminal` | Execute bash commands (`rg`, `grep`, `find`, `cat`, `sed`, `head`, `tail`, `wc`) in a persistent shell session |
| `localization_finish` | Submit structured code localization results (`file`, `class_name`, `function_name`) |
## Filtering
From the original 54,845 rows (39,040 from CodeScout-14B + 15,805 from CodeScout-4B):
- Only rows with **total reward > 0** retained (~72-77% of each config)
- Zero-reward rollouts (completely failed localization attempts) dropped
- Reward is multi-level F1 across file, module, entity, and multiturn components
## Conversion Details
- Source uses OpenAI function-calling format with `content` as list of `{type, text}` blocks — flattened to plain strings
- Assistant messages have `content: null` + `tool_calls` — **all reasoning is synthesized** from 4 domain-appropriate template pools (12 variations each): initial exploration, bridge reasoning, final reasoning, tail answers
- Parallel tool calls (up to 5 per assistant message) randomly coupled (60%) or split (40%) into sequential turns
- `localization_finish` tool calls treated as regular tool_call → tool_output cycles
- `tool → tool` source transitions (from parallel responses) consumed into coupled `tool_output` turns
- Conversations ending on `tool` role (299/300 in source) get tail reasoning + answer appended
- Two validation layers: FSM transition check + content-tag non-empty check
## Why Mid-Training Only
This dataset teaches:
- ✅ Correct FSM structure and tool-use syntax
- ✅ Bash tool patterns for code exploration (`rg`, `sed`, `find`, `cat`)
- ✅ Multi-tool orchestration and coupled parallel execution
- ✅ Structured output submission patterns
- ✅ Repository navigation strategies
But lacks:
- ❌ Native chain-of-thought reasoning (all `<think>` blocks are templates)
- ❌ Genuine analysis of tool outputs
- ❌ Hypothesis formation and evaluation
## Usage
```python
import json, random, re
from datasets import load_dataset
VALID_NEXT = {
"system": {"user"}, "user": {"reasoning"},
"reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"},
"tool_output": {"reasoning"}, "answer": {"user"},
}
ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts", split="train")
print(f"Loaded: {len(ds):,} rows\n")
idx = random.randint(0, len(ds) - 1)
row = ds[idx]
msgs = json.loads(row["messages"])
cfg = row["config"]
roles = [m["role"] for m in msgs]
tc = sum(1 for r in roles if r == "tool_call")
print(f"Row {idx} | config={cfg} | {len(msgs)} turns | {tc} tool_calls")
print(f"Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}\n")
# ── Validation 1: FSM transitions
bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1)
if roles[j+1] not in VALID_NEXT.get(roles[j], set())]
if bad:
print(f"!! FSM VIOLATIONS: {len(bad)}")
for pos, a, b in bad[:5]:
print(f" [{pos}] {a} -> {b}")
else:
print("✓ FSM transitions: all valid")
# ── Validation 2: content tags
tag_errors = []
for i, t in enumerate(msgs):
r, c = t["role"], t["content"]
if r == "reasoning":
if not re.search(r'<think>.+</think>', c, re.DOTALL):
tag_errors.append((i, r, "empty <think>"))
elif r == "tool_call":
if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL):
tag_errors.append((i, r, "empty <tool_call>"))
else:
for m in re.finditer(r'<tool_call>\s*(\{.*?\})\s*</tool_call>', c, re.DOTALL):
try:
obj = json.loads(m.group(1))
if "name" not in obj or "arguments" not in obj:
tag_errors.append((i, r, "missing name/arguments"))
except json.JSONDecodeError as e:
tag_errors.append((i, r, f"invalid JSON: {e}"))
elif r == "answer":
if not re.search(r'<answer>.+</answer>', c, re.DOTALL):
tag_errors.append((i, r, "empty <answer>"))
elif r == "tool_output":
if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL):
tag_errors.append((i, r, "empty <tool_response>"))
if tag_errors:
print(f"!! TAG ERRORS: {len(tag_errors)}")
for pos, role, err in tag_errors[:5]:
print(f" [{pos}] {role}: {err}")
else:
print("✓ Content tags: all valid")
# ── Validation 3: structure checks
checks = []
if roles[0] != "system":
checks.append("first role is not system")
if roles[1] != "user":
checks.append("second role is not user")
if roles[-1] != "answer":
checks.append(f"last role is {roles[-1]}, expected answer")
if any(roles[i] == roles[i+1] for i in range(len(roles)-1)):
dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]]
checks.append(f"consecutive same-role at {dupes[0]}")
if checks:
print(f"!! STRUCTURE ISSUES: {len(checks)}")
for c in checks:
print(f" {c}")
else:
print("✓ Structure: system→user→...→answer, no consecutive duplicates")
# ── Coupled tool call stats
coupled = sum(1 for t in msgs if t["role"] == "tool_call" and t["content"].count("<tool_call>") > 1)
single = sum(1 for t in msgs if t["role"] == "tool_call" and t["content"].count("<tool_call>") == 1)
print(f"\nTool call turns: {single} single, {coupled} coupled")
# ── Print turns
print(f"\n{'='*70}")
print(f"FULL CONVERSATION ({len(msgs)} turns)")
print(f"{'='*70}\n")
for i, m in enumerate(msgs):
content = m["content"]
if m["role"] == "system":
content = content[:200] + "..."
elif len(content) > 300:
content = content[:300] + "..."
print(f"[{i}] {m['role']}:\n{content}\n")
```
Use this for **structure learning** in mid-training, then fine-tune with reasoning-rich datasets for final SFT. ss
提供机构:
AmanPriyanshu



