five

AmanPriyanshu/tool-reasoning-sft-CODING-CoderForge-Preview-data-cleaned-rectified

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-CODING-CoderForge-Preview-data-cleaned-rectified
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en tags: - code - agent - tool-use - reasoning - sft - multi-turn - software-engineering size_categories: - 100K<n<1M --- # CoderForge-Preview — Cleaned & Rectified Cleaned and restructured version of [togethercomputer/CoderForge-Preview](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview). The original dataset contains ~413K long-horizon coding agent trajectories generated by OpenHands across open-source repositories. This version converts the OpenAI function-call format into a strict multi-turn conversation structure with explicit reasoning traces, validated JSON tool calls, and proper role transitions. Original Dataset: [togethercomputer/CoderForge-Preview](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview) Format Inspiration: [SupritiVijay/dr-tulu-sft-deep-research-agent-data-cleaned-rectified](https://huggingface.co/datasets/SupritiVijay/dr-tulu-sft-deep-research-agent-data-cleaned-rectified) ## What Changed ### Original Format (OpenAI Function Calls) ``` - system: [OpenHands agent prompt] - user: [task description] - assistant: [text + tool_calls: [{function: {name: "think", arguments: ...}}, {function: {name: "execute_bash", arguments: ...}}]] - tool: [tool response for execute_bash] - tool: [tool response for think] - assistant: [text + tool_calls: [{function: {name: "str_replace_editor", arguments: ...}}]] - tool: [tool response] - ... - assistant: [tool_calls: [{function: {name: "finish", arguments: {message: "..."}}}]] ``` ### New Format (Multi-Turn with Reasoning) ``` - system: Updated system prompt with JSON tool schemas + protocol - user: Task description - reasoning: <think>...</think> - tool_call: <tool_call>{"name": "execute_bash", "arguments": {...}}</tool_call> - tool_output: <tool_response>...</tool_response> - reasoning: <think>...</think> - tool_call: <tool_call>{"name": "str_replace_editor", "arguments": {...}}</tool_call> - tool_output: <tool_response>...</tool_response> - reasoning: <think>...</think> - answer: <answer>...</answer> ``` ## Distribution | Split | Rows | % | |---|---|---| | filtered_reward1 | 147,256 | 39.07% | | SWE_Smith | 137,470 | 36.47% | | SWE_Rebench | 61,719 | 16.37% | | R2E_Gym | 30,493 | 8.09% | ## Filtering - **License filtering**: Dropped `BSD-4-Clause`, bare `BSD`, and `MIT-CMU` licenses (~2.6% of rows) - **Transition validation**: Rows with invalid role transitions dropped - **Ending validation**: Rows not ending with `answer` role dropped (e.g., timed-out trajectories) - **JSON validation**: Rows with malformed tool call JSON dropped - All four original splits (`SWE_Rebench`, `SWE_Smith`, `R2E_Gym`, `filtered_reward1`) merged; original split preserved in `split_name` column ## License | License | Rows | % | |---|---|---| | MIT | 179,754 | 47.69% | | BSD-3-Clause | 97,433 | 25.85% | | Apache-2.0 | 55,884 | 14.83% | | BSD-2-Clause | 28,617 | 7.59% | | HPND | 8,111 | 2.15% | | MIT AND Apache-2.0 | 3,310 | 0.88% | | BSD-3-Clause AND MIT | 2,028 | 0.54% | | ISC | 827 | 0.22% | | PSF-2.0 | 743 | 0.20% | | CC0-1.0 | 167 | 0.04% | | PostgreSQL | 32 | 0.01% | | MIT-0 | 32 | 0.01% | ## Usage ``` import json, random from collections import Counter from datasets import load_dataset from tqdm import tqdm REPO = "AmanPriyanshu/CoderForge-Preview-data-cleaned-rectified" print(f"Loading {REPO}...") ds = load_dataset(REPO, split="train") total = len(ds) print(f"Total rows: {total:,}\n") idx = random.randint(0, total - 1) row = ds[idx] print(f"{'='*80}") print(f"RANDOM ROW (idx={idx})") print(f"{'='*80}") for k, v in row.items(): if k == "messages": msgs = json.loads(v) print(f"\nmessages: ({len(msgs)} turns)") print(json.dumps(msgs, indent=4, ensure_ascii=False)) else: s = str(v) print(f"\n{k}: {s[:500]}{'...' if len(s) > 500 else ''}") print(f"\n{'='*80}\n") ``` ## Citation ```bibtex @misc{priyanshu2026coderforgecleaned, title={{CoderForge-Preview: Cleaned \& Rectified}}, author={Priyanshu, Aman}, year={2026}, howpublished={\url{https://huggingface.co/datasets/AmanPriyanshu/CoderForge-Preview-data-cleaned-rectified}} } ```
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作