DCAgent2/swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641

Name: DCAgent2/swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641
Creator: DCAgent2
Published: 2026-04-25 01:26:38
License: 暂无描述

Hugging Face2026-04-25 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/DCAgent2/swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: content dtype: string - name: role dtype: string - name: agent dtype: string - name: model dtype: string - name: model_provider dtype: string - name: date dtype: string - name: task dtype: string - name: episode dtype: string - name: run_id dtype: string - name: trial_name dtype: string - name: tool_definitions list: - name: function struct: - name: description dtype: string - name: name dtype: string - name: parameters struct: - name: additionalProperties dtype: bool - name: properties struct: - name: code struct: - name: description dtype: string - name: type dtype: string - name: command struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: file_text struct: - name: description dtype: string - name: type dtype: string - name: insert_line struct: - name: description dtype: string - name: type dtype: string - name: is_input struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: message struct: - name: description dtype: string - name: type dtype: string - name: new_str struct: - name: description dtype: string - name: type dtype: string - name: old_str struct: - name: description dtype: string - name: type dtype: string - name: path struct: - name: description dtype: string - name: type dtype: string - name: security_risk struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: task_list struct: - name: description dtype: string - name: items struct: - name: additionalProperties dtype: bool - name: properties struct: - name: id struct: - name: description dtype: string - name: type dtype: string - name: notes struct: - name: description dtype: string - name: type dtype: string - name: status struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: title struct: - name: description dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: thought struct: - name: description dtype: string - name: type dtype: string - name: timeout struct: - name: description dtype: string - name: type dtype: string - name: view_range struct: - name: description dtype: string - name: items struct: - name: type dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: result dtype: string - name: verifier_output dtype: string splits: - name: train num_bytes: 67386072 num_examples: 297 download_size: 44950056 dataset_size: 67386072 configs: - config_name: default data_files: - split: train path: data/train-* ---

提供机构：

DCAgent2

搜集汇总

数据集介绍

构建方式

在软件工程与人工智能的交叉领域，自动化代码修复任务对数据集质量提出了严苛要求。swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641数据集源自SWE-bench验证集，经过随机抽取100个文件夹中的实例，并由Skywork-SWE-32B模型完成自动化交互生成。每条记录包含完整的agent对话历史、工具调用定义及其参数（如代码修改、文件操作、任务列表等），以及最终结果与验证器输出，形成了结构化的多轮交互数据。

使用方法

研究者可将此数据集用于训练或评估代码生成与软件工程Agent模型。通过解析conversations字段中的多轮对话与tool_definitions中的函数调用，可以重构模型的完整工作流。此外，result与verifier_output字段提供了二元结果与自动验证信号，适用于监督学习中的二分类或奖励建模任务。数据以标准JSON格式存储，支持通过HuggingFace Datasets库加载train分片（297条样本），便于直接集成至现有实验流水线。

背景与挑战

背景概述

软件工程领域的大语言模型评估长期受限于缺乏真实、可复现的编程任务基准。为弥补这一空白，swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641 数据集应运而生。该数据集由 Skywork 团队于 2024 年 4 月创建，聚焦于利用 32B 参数规模的大语言模型在代码仓库级别自动解决软件工程问题。核心研究问题在于评测模型对真实 GitHub 仓库中随机抽取的 100 个文件夹内任务的代理执行能力，涵盖代码调试、功能完善等复杂场景。该数据集的推出为自动代码修复、智能编程助手等方向提供了重要的量化评测手段，推动了软件工程与人工智能交叉领域的发展。

当前挑战

该数据集直面两大核心挑战。领域问题层面，传统代码评估多聚焦于函数级补全或单元测试，难以反映模型在大型仓库中定位、理解与修改多文件依赖的实际能力，而本数据集要求模型处理跨文件夹的上下文关联与工具调用，显著提升了任务生态真实性。构建过程中，团队需从 SWE-bench 验证集中随机选取任务，确保样例分布无偏；同时为每个任务设计标准化的工具定义（如文件编辑、命令执行、安全检查等），并记录完整的对话历史与验证结果，这要求精确的路由策略与异常处理机制，以保障数据质量与复现性。

常用场景

经典使用场景

在软件工程与人工智能的交叉领域中，swebench_verified_random_100_folders_Skywork_SWE_32B_20260424_231641数据集以其独特的对话结构与工具调用记录，成为评估大型语言模型在代码修复与任务执行方面能力的标杆。该数据集收录了模型与开发环境之间的完整交互日志，涵盖了问题描述、命令执行、文件编辑与结果验证等多维信息，为研究者提供了模拟真实软件维护场景的丰富数据。其经典使用场景聚焦于自动化漏洞修复与功能增强任务的性能评测，通过对比模型提出的解决方案与实际预期结果，能够量化模型在理解复杂代码库、定位缺陷根源并生成正确补丁方面的能力。

解决学术问题

该数据集的核心价值在于解决了软件工程领域中一个长期存在的学术难题——如何系统化、可重复地评估语言模型在真实世界代码修复任务中的表现。传统评测方法多依赖于静态的问答或生成任务，难以捕捉代码修改过程中的动态决策与多步推理。此数据集通过记录完整的工具使用与对话历程，使研究者能够深入分析模型在探索性测试、错误定位与修复策略制定中的行为模式，从而推动了对模型推理机制与代码理解能力的客观度量。其影响力体现在为建立标准化的软件代理评测体系提供了数据基石，促进了学术社区对AI辅助编程能力的深入理解与持续改进。

实际应用

在工业实践中，这一数据集为构建和验证智能代码助手提供了不可多得的训练与测试资源。基于这些真实交互记录，开发者可以训练文本生成模型学会如何高效地阅读错误报告、搜索相关代码片段并执行精确的编辑操作。该数据集所涵盖的多样化工具有效模拟了集成开发环境中的典型操作流程，因此被广泛应用于自动化持续集成管道、智能故障排查系统以及辅助代码审查工具的开发。通过这些实际应用，软件开发团队能够显著缩短问题响应时间，降低手动调试带来的认知负担，进而提升整体代码维护的效率与可靠性。

数据集最近研究