CathleenTico/Nemotron-Terminal-Corpus2

Name: CathleenTico/Nemotron-Terminal-Corpus2
Creator: CathleenTico
Published: 2026-03-21 13:40:11
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/CathleenTico/Nemotron-Terminal-Corpus2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - question-answering language: - en tags: - code size_categories: - 100K<n<1M configs: - config_name: dataset_adapters data_files: - split: train path: "dataset_adapters/*.parquet" - config_name: skill_based_easy data_files: - split: train path: "synthetic_tasks/skill_based/easy/*/data_filtered.parquet" - config_name: skill_based_medium data_files: - split: train path: "synthetic_tasks/skill_based/medium/*/data_filtered.parquet" - config_name: skill_based_mixed data_files: - split: train path: "synthetic_tasks/skill_based/mixed/*/data_filtered.parquet" --- # Terminal-Corpus: Large-Scale SFT Dataset for Terminal Agents Terminal-Corpus is a large-scale Supervised Fine-Tuning (SFT) dataset designed to scale the terminal interaction capabilities of Large Language Models (LLMs). Developed by NVIDIA, this dataset was built using the **Terminal-Task-Gen** pipeline, which combines dataset adaptation with synthetic task generation across diverse domains. ## 🚀 Key Results & Performance The high-quality trajectories in Terminal-Corpus enable models of various sizes to achieve performance that rivals or exceeds much larger frontier models on the **Terminal-Bench 2.0** benchmark. ### 1. Overall Performance Comparison Training on Terminal-Corpus yields substantial gains across the Qwen3 model family: | Model Size | Base Model (Qwen3) Accuracy | Nemotron-Terminal Accuracy | Improvement | | :--- | :---: | :---: | :---: | | **8B** | 2.5% ± 0.5 | **13.0% ± 2.2** | ~5.2x | | **14B** | 4.0% ± 1.3 | **20.2% ± 2.7** | ~5.0x | | **32B** | 3.4% ± 1.6 | **27.4% ± 2.4** | ~8.0x | The **Nemotron-Terminal-32B** (27.4%) outperforms the 480B-parameter **Qwen3-Coder** (23.9%) and **Gemini 2.5 Flash** (16.9%). **Nemotron-Terminal-14B** (20.2%) achieves higher accuracy than the 120B **GPT-OSS (high)** (18.7%). ### 2. Domain-Specific Breakthroughs The dataset unlocks functional utility in complex domains where base models previously showed near-zero capability: | Category | Qwen3-32B (Base) | Nemotron-Terminal-32B | | :--- | :---: | :---: | | **Data Querying** | 0.0% | **60.0%** | | **Model Training** | 0.0% | **50.0%** | | **Data Processing** | 5.0% | **50.0%** | | **Debugging** | 0.0% | **33.3%** | | **Software Engineering** | 5.0% | **31.7%** | ## 📂 Dataset Composition The released dataset contains approximately 366k high-quality execution trajectories split into two major streams: * **Dataset Adapters (~226k samples)**: Transformations of high-quality Math, Code, and Software Engineering (SWE) datasets into terminal-based formats. * **Skill-based Synthetic Tasks (~140k samples)**: Novel tasks generated from a structured taxonomy of primitive terminal skills. ## 📜 Citation If you use this dataset in your research, please cite the following work: ```bibtex @misc{pi2026dataengineeringscalingllm, title={On Data Engineering for Scaling LLM Terminal Capabilities}, author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping}, year={2026}, eprint={2602.21193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.21193}, }

--- 许可证：CC BY 4.0 任务类别： - 问答（Question-Answering）语言： - 英语标签： - 代码（code）样本量区间：10万 < 样本量 < 100万配置项： - 配置名称：dataset_adapters 数据文件： - 划分：训练集路径："dataset_adapters/*.parquet" - 配置名称：skill_based_easy 数据文件： - 划分：训练集路径："synthetic_tasks/skill_based/easy/*/data_filtered.parquet" - 配置名称：skill_based_medium 数据文件： - 划分：训练集路径："synthetic_tasks/skill_based/medium/*/data_filtered.parquet" - 配置名称：skill_based_mixed 数据文件： - 划分：训练集路径："synthetic_tasks/skill_based/mixed/*/data_filtered.parquet" --- # Terminal-Corpus：面向终端智能体的大规模监督微调（Supervised Fine-Tuning, SFT）数据集 Terminal-Corpus是一款大规模监督微调数据集，旨在提升大语言模型（Large Language Model, LLM）的终端交互能力。该数据集由NVIDIA开发，依托**Terminal-Task-Gen**流程构建，该流程结合了数据集适配与跨多样领域的合成任务生成。 ## 🚀 核心成果与性能表现 Terminal-Corpus中的高质量交互轨迹，可使不同参数量的模型在**Terminal-Bench 2.0**基准测试中取得媲美甚至超越超大型前沿模型的性能。 ### 1. 整体性能对比在Terminal-Corpus上进行训练可使Qwen3模型系列获得显著性能提升： | 模型参数量 | 基础模型（Qwen3）准确率 | Nemotron-Terminal 准确率 | 性能提升幅度 | | :--- | :---: | :---: | :---: | | **8B** | 2.5% ± 0.5 | **13.0% ± 2.2** | ~5.2x | | **14B** | 4.0% ± 1.3 | **20.2% ± 2.7** | ~5.0x | | **32B** | 3.4% ± 1.6 | **27.4% ± 2.4** | ~8.0x | **Nemotron-Terminal-32B**（准确率27.4%）的性能优于参数量为4800亿的**Qwen3-Coder**（23.9%）与**Gemini 2.5 Flash**（16.9%）。**Nemotron-Terminal-14B**（20.2%）的准确率高于参数量为1200亿的**GPT-OSS (high)**（18.7%）。 ### 2. 领域专属突破该数据集使基础模型此前近乎零能力的复杂领域实现了功能实用性突破： | 任务类别 | Qwen3-32B（基础模型） | Nemotron-Terminal-32B | | :--- | :---: | :---: | | **数据查询** | 0.0% | **60.0%** | | **模型训练** | 0.0% | **50.0%** | | **数据处理** | 5.0% | **50.0%** | | **调试排错** | 0.0% | **33.3%** | | **软件工程** | 5.0% | **31.7%** | ## 📂 数据集构成本次发布的数据集包含约36.6万条高质量执行轨迹，分为两大主要类别： * **数据集适配器（Dataset Adapters，约22.6万样本）**：将高质量数学、代码与软件工程（Software Engineering, SWE）数据集转换为终端交互格式。 * **基于技能的合成任务（Skill-based Synthetic Tasks，约14万样本）**：基于结构化的原生终端技能分类法生成的新型任务。 ## 📜 引用说明若您在研究中使用该数据集，请引用以下文献： bibtex @misc{pi2026dataengineeringscalingllm, title={On Data Engineering for Scaling LLM Terminal Capabilities}, author={Renjie Pi and Grace Lam and Mohammad Shoeybi and Pooya Jannaty and Bryan Catanzaro and Wei Ping}, year={2026}, eprint={2602.21193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.21193}, }

提供机构：

CathleenTico

5,000+

优质数据集

54 个

任务类型

进入经典数据集