AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts

Name: AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts
Creator: AmanPriyanshu
Published: 2026-03-24 20:44:57
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-generation language: - en tags: - reasoning - tool-calling - agentic - multi-turn - code-localization - bash-tools - mid-training - reinforcement-learning - software-engineering - SWE-Bench size_categories: - 10K<n<100K --- # CodeScout Training Rollouts — Cleaned & Rectified ~40K multi-turn code localization agent trajectories converted into a strict reasoning + tool-call format with validated FSM transitions. Supports coupled (parallel) tool calls. > **⚠️ Mid-training dataset.** This dataset contains synthesized reasoning templates (not native chain-of-thought). It is suitable for **mid-training** to teach tool-use mechanics, FSM structure, and bash exploration patterns. It is **not recommended as a final SFT dataset** for reasoning capabilities — use datasets with native `<think>` blocks (e.g., REDSearcher, OpenSeeker) for that purpose. ## Origin Derived from [OpenHands/CodeScout_Training_Rollouts](https://huggingface.co/datasets/OpenHands/CodeScout_Training_Rollouts) (both `CodeScout_14B` and `CodeScout_4B` configs). ### About CodeScout CodeScout is a family of open-source RL-trained code search agents that achieve **state-of-the-art repository-level code localization** on SWE-Bench using nothing more than a standard Unix terminal — no static analysis, no repository graphs, no language-specific tooling. The models are trained with **GSPO (Group Sequence Policy Optimization)** using multi-level F1 rewards at the file, module, and function level. **Key results (SWE-Bench Verified):** | Model | File F1 | Function F1 | |---|---|---| | CodeScout-14B | 68.57 | 40.32 | | CodeScout-4B | 68.52 | 36.78 | | CodeScout-1.7B | 55.46 | 28.22 | CodeScout-14B outperforms 2–18× larger base and post-trained LLMs across all benchmarks, surpasses GPT-5 and approaches Claude Sonnet 4.5 using RepoNavigator, and achieves 8–33% higher function-level F1 than Qwen3-32B (Thinking). The training rollouts in this dataset come from 9,600 SWE-Smith instances across 128 repositories, with 4 rollouts per instance and up to 4 turns per episode. 📄 **Paper:** [CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents](https://arxiv.org/abs/2603.17829) 💻 **Code:** [OpenHands](https://github.com/All-Hands-AI/OpenHands) 🤗 **Models:** [CodeScout-14B](https://huggingface.co/OpenHands/CodeScout-14B) · [CodeScout-4B](https://huggingface.co/OpenHands/CodeScout-4B) · [CodeScout-1.7B](https://huggingface.co/OpenHands/CodeScout-1.7B) ## Format Each row contains a structured multi-turn conversation with synthesized reasoning traces and validated tool calls. ### Message Roles | Role | Content | |---|---| | `system` | Tool-use protocol + JSON tool schemas + code localization instructions | | `user` | Issue description + repository path | | `reasoning` | `<think>…</think>` — synthesized reasoning (template-based, not native) | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — one or more per turn | | `tool_output` | `<tool_response>…</tool_response>` — one per tool_call, in matching order | | `answer` | `<answer>…</answer>` — final confirmation | ### Coupled Tool Calls This dataset introduces **coupled (parallel) tool calls**: a single `tool_call` turn may contain multiple `<tool_call>` blocks, with corresponding multiple `<tool_response>` blocks in the following `tool_output` turn: ``` tool_call: <tool_call>{"name": "terminal", "arguments": {"command": "rg 'pattern1'"}}</tool_call> <tool_call>{"name": "terminal", "arguments": {"command": "find . -name '*.py'"}}</tool_call> tool_output: <tool_response>...results for pattern1...</tool_response> <tool_response>...results for find...</tool_response> ``` Roughly 60% of multi-call assistant messages are kept coupled; 40% are split into sequential turns with bridge reasoning. ### Trajectory Structure ``` system → user → reasoning → [tool_call → tool_output → reasoning →]* answer ``` ## Schema Single Parquet file with zstd compression. | Column | Type | Description | |---|---|---| | `messages` | string | Converted conversation (JSON list of `{role, content}`) | | `config` | string | Source config: `CodeScout_14B` or `CodeScout_4B` | ## Tools 2 tools available per trajectory: | Tool | Description | |---|---| | `terminal` | Execute bash commands (`rg`, `grep`, `find`, `cat`, `sed`, `head`, `tail`, `wc`) in a persistent shell session | | `localization_finish` | Submit structured code localization results (`file`, `class_name`, `function_name`) | ## Filtering From the original 54,845 rows (39,040 from CodeScout-14B + 15,805 from CodeScout-4B): - Only rows with **total reward > 0** retained (~72-77% of each config) - Zero-reward rollouts (completely failed localization attempts) dropped - Reward is multi-level F1 across file, module, entity, and multiturn components ## Conversion Details - Source uses OpenAI function-calling format with `content` as list of `{type, text}` blocks — flattened to plain strings - Assistant messages have `content: null` + `tool_calls` — **all reasoning is synthesized** from 4 domain-appropriate template pools (12 variations each): initial exploration, bridge reasoning, final reasoning, tail answers - Parallel tool calls (up to 5 per assistant message) randomly coupled (60%) or split (40%) into sequential turns - `localization_finish` tool calls treated as regular tool_call → tool_output cycles - `tool → tool` source transitions (from parallel responses) consumed into coupled `tool_output` turns - Conversations ending on `tool` role (299/300 in source) get tail reasoning + answer appended - Two validation layers: FSM transition check + content-tag non-empty check ## Why Mid-Training Only This dataset teaches: - ✅ Correct FSM structure and tool-use syntax - ✅ Bash tool patterns for code exploration (`rg`, `sed`, `find`, `cat`) - ✅ Multi-tool orchestration and coupled parallel execution - ✅ Structured output submission patterns - ✅ Repository navigation strategies But lacks: - ❌ Native chain-of-thought reasoning (all `<think>` blocks are templates) - ❌ Genuine analysis of tool outputs - ❌ Hypothesis formation and evaluation ## Usage ```python import json, random, re from datasets import load_dataset VALID_NEXT = { "system": {"user"}, "user": {"reasoning"}, "reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"}, "tool_output": {"reasoning"}, "answer": {"user"}, } ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenHands-CodeScout_Training_Rollouts", split="train") print(f"Loaded: {len(ds):,} rows\n") idx = random.randint(0, len(ds) - 1) row = ds[idx] msgs = json.loads(row["messages"]) cfg = row["config"] roles = [m["role"] for m in msgs] tc = sum(1 for r in roles if r == "tool_call") print(f"Row {idx} | config={cfg} | {len(msgs)} turns | {tc} tool_calls") print(f"Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}\n") # ── Validation 1: FSM transitions bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1) if roles[j+1] not in VALID_NEXT.get(roles[j], set())] if bad: print(f"!! FSM VIOLATIONS: {len(bad)}") for pos, a, b in bad[:5]: print(f" [{pos}] {a} -> {b}") else: print("✓ FSM transitions: all valid") # ── Validation 2: content tags tag_errors = [] for i, t in enumerate(msgs): r, c = t["role"], t["content"] if r == "reasoning": if not re.search(r'<think>.+</think>', c, re.DOTALL): tag_errors.append((i, r, "empty <think>")) elif r == "tool_call": if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL): tag_errors.append((i, r, "empty <tool_call>")) else: for m in re.finditer(r'<tool_call>\s*(\{.*?\})\s*</tool_call>', c, re.DOTALL): try: obj = json.loads(m.group(1)) if "name" not in obj or "arguments" not in obj: tag_errors.append((i, r, "missing name/arguments")) except json.JSONDecodeError as e: tag_errors.append((i, r, f"invalid JSON: {e}")) elif r == "answer": if not re.search(r'<answer>.+</answer>', c, re.DOTALL): tag_errors.append((i, r, "empty <answer>")) elif r == "tool_output": if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL): tag_errors.append((i, r, "empty <tool_response>")) if tag_errors: print(f"!! TAG ERRORS: {len(tag_errors)}") for pos, role, err in tag_errors[:5]: print(f" [{pos}] {role}: {err}") else: print("✓ Content tags: all valid") # ── Validation 3: structure checks checks = [] if roles[0] != "system": checks.append("first role is not system") if roles[1] != "user": checks.append("second role is not user") if roles[-1] != "answer": checks.append(f"last role is {roles[-1]}, expected answer") if any(roles[i] == roles[i+1] for i in range(len(roles)-1)): dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]] checks.append(f"consecutive same-role at {dupes[0]}") if checks: print(f"!! STRUCTURE ISSUES: {len(checks)}") for c in checks: print(f" {c}") else: print("✓ Structure: system→user→...→answer, no consecutive duplicates") # ── Coupled tool call stats coupled = sum(1 for t in msgs if t["role"] == "tool_call" and t["content"].count("<tool_call>") > 1) single = sum(1 for t in msgs if t["role"] == "tool_call" and t["content"].count("<tool_call>") == 1) print(f"\nTool call turns: {single} single, {coupled} coupled") # ── Print turns print(f"\n{'='*70}") print(f"FULL CONVERSATION ({len(msgs)} turns)") print(f"{'='*70}\n") for i, m in enumerate(msgs): content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{i}] {m['role']}:\n{content}\n") ``` Use this for **structure learning** in mid-training, then fine-tune with reasoning-rich datasets for final SFT. ss

提供机构：

AmanPriyanshu

5,000+

优质数据集

54 个

任务类型

进入经典数据集