DCAgent2/swebench_verified_NVIDIA_Nemotron_3_Nano_30B_A3B_BF16_20260427_232134-traces

Name: DCAgent2/swebench_verified_NVIDIA_Nemotron_3_Nano_30B_A3B_BF16_20260427_232134-traces
Creator: DCAgent2
Published: 2026-04-30 07:48:18
License: 暂无描述

Hugging Face2026-04-30 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/DCAgent2/swebench_verified_NVIDIA_Nemotron_3_Nano_30B_A3B_BF16_20260427_232134-traces

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: content dtype: string - name: role dtype: string - name: agent dtype: string - name: model dtype: string - name: model_provider dtype: string - name: date dtype: string - name: task dtype: string - name: episode dtype: string - name: run_id dtype: string - name: trial_name dtype: string - name: tool_definitions list: - name: function struct: - name: description dtype: string - name: name dtype: string - name: parameters struct: - name: additionalProperties dtype: bool - name: properties struct: - name: code struct: - name: description dtype: string - name: type dtype: string - name: command struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: file_text struct: - name: description dtype: string - name: type dtype: string - name: insert_line struct: - name: description dtype: string - name: type dtype: string - name: is_input struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: message struct: - name: description dtype: string - name: type dtype: string - name: new_str struct: - name: description dtype: string - name: type dtype: string - name: old_str struct: - name: description dtype: string - name: type dtype: string - name: path struct: - name: description dtype: string - name: type dtype: string - name: security_risk struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: task_list struct: - name: description dtype: string - name: items struct: - name: additionalProperties dtype: bool - name: properties struct: - name: id struct: - name: description dtype: string - name: type dtype: string - name: notes struct: - name: description dtype: string - name: type dtype: string - name: status struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: title struct: - name: description dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: thought struct: - name: description dtype: string - name: type dtype: string - name: timeout struct: - name: description dtype: string - name: type dtype: string - name: view_range struct: - name: description dtype: string - name: items struct: - name: type dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: result dtype: string - name: verifier_output dtype: string splits: - name: train num_bytes: 573177392 num_examples: 1469 download_size: 513649416 dataset_size: 573177392 configs: - config_name: default data_files: - split: train path: data/train-* ---

提供机构：

DCAgent2

搜集汇总

数据集介绍

构建方式

该数据集基于SWE-bench Verified基准构建，旨在记录NVIDIA Nemotron 3 Nano 30B A3B BF16模型在软件工程任务中的交互轨迹。数据通过强化学习框架收集，模拟智能体在给定任务指令下与代码环境的多轮对话过程。每条样本包含完整的对话记录（conversations）、任务描述（task）、智能体标识（agent）、模型版本（model）及运行元信息（run_id、trial_name等）。工具定义（tool_definitions）详细描述了模型可调用的代码编辑、文件操作和任务管理接口，确保交互过程的结构化与可复现性。

使用方法

数据集适用于微调代码生成智能体或评估模型在复杂软件工程任务中的表现。使用时可通过HuggingFace Datasets库加载train分片，解析conversations字段获取多轮对话历史，并利用tool_definitions重建模型的决策空间。研究者可依据result字段筛选成功/失败案例，或结合verifier_output分析模型错误模式。建议将agent与model字段作为分组变量，以进行跨策略或跨模型的对比实验。

背景与挑战

背景概述

随着大语言模型在代码生成与软件工程任务中展现出日益强大的能力，如何系统性地评估和验证模型在真实软件工程场景中的表现成为关键挑战。该数据集由NVIDIA研究团队于2026年4月创建，核心研究问题聚焦于评估NVIDIA Nemotron 3 Nano 30B A3B BF16模型在SWE-bench平台上的多轮交互式代码修复能力。通过记录模型在1469个经过验证的软件工程任务中的完整对话轨迹、工具调用序列及最终结果，该数据集为研究基于智能体的代码修正范式提供了重要资源。其影响力在于，它首次将先进的稀疏专家混合模型与标准化软件工程评测基准深度融合，为后续模型比较和可复现研究奠定了基础。

当前挑战

该数据集致力于应对的核心领域挑战在于，传统代码生成基准往往忽略多步修复过程中的上下文依赖与工具使用策略，而真实软件缺陷修复需要模型在长达数十轮的交互中持续追踪代码状态、执行命令并做出精确修改。从构建视角审视，主要挑战包括：其一，如何设计统一且完备的工具定义体系，涵盖代码编辑、文件查看、Shell执行等十余种原子操作，确保模型行为可被精确表征与复现；其二，解决评估结果验证的可靠性问题，即通过verifier_output字段记录自动化验证结果，避免人工判断的主观偏差；其三，面对1469个复杂场景，需要保证不同模型生成轨迹的注入不会导致训练数据泄露或评估偏向，从而维护评测的公平性与泛化性。

常用场景

经典使用场景

在软件工程与人工智能交叉领域，该数据集聚焦于通过大语言模型驱动的智能体完成真实世界的软件工程任务。其经典的使用场景包括代码生成、错误修复、测试用例编写以及代码库维护等自动化操作。研究者利用该数据集记录的对话轨迹与工具调用序列，能够复现并评估智能体在复杂软件项目中的表现，尤其适合验证模型在长上下文理解、多步推理以及精准调用外部工具方面的能力。

解决学术问题

该数据集有效解决了自动化软件工程研究中缺乏高质量、结构化交互轨迹的问题。传统研究中，智能体往往在简化环境中评估，难以反映真实项目中的依赖管理、版本冲突等挑战。通过提供包含详细工具定义与执行结果的完整对话，它使学术界能够深入分析智能体在代码编辑、文件操作与命令执行时的决策逻辑，进而推动多智能体协作、指令遵循与安全风险规避等前沿方向的发展。这不仅提升了研究结果的复现性，也为探索更鲁棒的智能体架构奠定了坚实的数据基础。

实际应用

在实际应用中，该数据集可用于训练和微调面向软件开发全流程的AI助手。例如，企业可基于此数据开发智能代码审查工具，自动检测补丁中的安全风险与逻辑漏洞，或构建能够自主修复开源项目Bug的机器人。此外，数据集中丰富的任务列表与状态追踪信息，能支撑项目管理中的自动化任务分解与进度跟踪，显著提升开发团队的效率。借助这些真实场景的交互记录，AI系统得以学习如何像资深工程师一样高效地理解需求、定位问题并实施修改。

数据集最近研究