five

AmanPriyanshu/tool-reasoning-sft-CODING-Nemotron-Terminal-Corpus-data-cleaned-rectified

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-CODING-Nemotron-Terminal-Corpus-data-cleaned-rectified
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - terminal - agent - tool-use - reasoning - sft - multi-turn - code - math - software-engineering size_categories: - 100K<n<1M --- # Nemotron-Terminal-Corpus — Cleaned & Rectified Cleaned and restructured version of [nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus). The original dataset contains ~366K terminal agent trajectories built by NVIDIA using the Terminal-Task-Gen pipeline across math, code, SWE, and synthetic skill-based domains. This version converts the JSON-action format into a strict multi-turn conversation structure with explicit reasoning traces, validated JSON tool calls, and proper role transitions. Original Dataset: [nvidia/Nemotron-Terminal-Corpus](https://huggingface.co/datasets/nvidia/Nemotron-Terminal-Corpus) ## What Changed ### Original Format (JSON Actions) ``` - user: [system prompt + task description + terminal state] - assistant: <think>...</think> {"analysis": "...", "plan": "...", "commands": [...], "task_complete": false} - user: [terminal output] - assistant: <think>...</think> {"analysis": "...", "plan": "...", "commands": [...], "task_complete": true} ``` ### New Format (Multi-Turn with Reasoning) ``` - system: System prompt with tool-use protocol + execute_commands schema - user: Task description + terminal state - reasoning: <think>analysis + plan + thinking</think> - tool_call: <tool_call>{"name": "execute_commands", "arguments": {"commands": [...]}}</tool_call> - tool_output: <tool_response>terminal output</tool_response> - reasoning: <think>...</think> - ... - answer: <answer>final summary</answer> ``` ## Files | File | Contents | Split Values | |---|---|---| | `dataset_adapters.parquet` | Math, Code, SWE adapter trajectories | `dataset_adapters` | | `skill.parquet` | Synthetic skill-based tasks | `easy`, `medium`, `mixed` | ## Message Roles | Role | Content | |---|---| | `system` | Terminal agent instructions + tool-use protocol + execute_commands schema | | `user` | Task description + initial terminal state | | `reasoning` | `<think>…</think>` — analysis, plan, and chain-of-thought | | `tool_call` | `<tool_call>{"name": "execute_commands", "arguments": {"commands": [...]}}</tool_call>` | | `tool_output` | `<tool_response>…</tool_response>` — terminal output | | `answer` | `<answer>…</answer>` — final task summary | ## License CC-BY-4.0 (same as original dataset). ## Citation ```bibtex @misc{pi2026dataengineeringscalingllm, title={On Data Engineering for Scaling LLM Terminal Capabilities}, author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping}, year={2026}, eprint={2602.21193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.21193}, } ```
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作