daVinci-Dev

Name: daVinci-Dev
Creator: maas
Published: 2026-05-08 17:48:39
License: 暂无描述

魔搭社区2026-05-08 更新2026-05-10 收录

下载链接：

https://modelscope.cn/datasets/GAIR/daVinci-Dev

下载链接

链接失效反馈

官方服务：

资源简介：

<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 10px"> <img src="assets/sii.png" alt="SII" width="100px"> <img src="assets/GAIR_Logo2.png" alt="GAIR" width="100px"> </div> <div align="center"> [![Paper](https://img.shields.io/badge/Paper-PDF-1f6feb.svg)](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf) [![arXiv](https://img.shields.io/badge/arXiv-2601.18418-b31b1b.svg)](https://arxiv.org/pdf/2601.18418) [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/GAIR-NLP/daVinci-Dev) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/GAIR/daVinci-Dev) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/GAIR/daVinci-Dev-72B) </div> <h1 align="center">daVinci-Dev Dataset: Agent-native Mid-training for Software Engineering</h1> <div align="center"> <img src="assets/teaser.png" width="100%" /> </div> This dataset release contains **agent-native trajectories** used in *daVinci-Dev: Agent-native Mid-training for Software Engineering*. ## Table of Contents - [Dataset at a glance](#dataset-at-a-glance) - [Dataset files](#dataset-files) - [Model Zoo](#model-zoo) - [Pipeline](#pipeline) - [Converting PR structure into LLM-trainable text](#converting-pr-structure-into-llm-trainable-text) - [LLM enhancement details](#llm-enhancement-details) - [Intended uses](#intended-uses) - [License](#license) - [Citation](#citation) ## Dataset at a glance It includes two complementary data sources: 1. **Contextually-native trajectories \$\mathcal{D}^{\text{ctx}}_{\text{py}}\$ (PR-derived, Python Variant)** - Constructed from GitHub pull requests. - We only include PRs from repositories with a **permissive license** in the open source release. - This is ~**60%** of the full PR-derived corpus, totaling ~**4.1M PRs**. - PR content is additionally summarized / enhanced with an LLM (details below). - The data is stored in structured parquet format. To convert it into LLM-trainable text, see the instructions below. 2. **Environmentally-native trajectories \$\mathcal{D}^{\text{env}}_{\text{pass}}\$ (executable rollouts, test-passing subset)** - Collected by rolling out [**SWE-Agent**](https://github.com/SWE-agent/SWE-agent) with [**GLM-4.6**](https://huggingface.co/zai-org/GLM-4.6) in real repositories from the [**SWE-rebench**](https://huggingface.co/datasets/nebius/SWE-rebench) dataset. - The source dataset is **CC-BY-4.0**: https://huggingface.co/datasets/nebius/SWE-rebench ## Dataset files ### Contextually-native \$\mathcal{D}^{\text{ctx}}_{\text{py}}\$ (PR-derived) These parquet shards store a structured representation of PRs. - Repository metadata (including detected license): - `./ctx-native/filtered_repos/part-0000.parquet` contains one row per filtered repository with fields like `repo_id`, `full_name`, `description`, `language`, stars, and `license_key` (schema: [`models.PublicRepo`](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L4)). - PR metadata (small file containing basic info about each PR): - `./ctx-native/filtered_prs/part-0000.parquet` - `./ctx-native/filtered_prs/part-0001.parquet` - … contain one row per PR with identifiers plus title/body/author metadata and coarse file-change stats (schema: [`models.PRMetadata`](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L23)). - Structured PR trajectories (LLM-enhanced): - `./ctx-native/llm_enhanced_prs/part-0000.parquet` - `./ctx-native/llm_enhanced_prs/part-0001.parquet` - `./ctx-native/llm_enhanced_prs/part-0002.parquet` - … contain one row per PR with repo/PR text fields, related issue content, relevant file snapshots, commit diffs with refined commit messages, and an LLM-written PR summary (schema: [`models.LLMEnhancedPRData`](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L148)). ### Environmentally-native \$\mathcal{D}^{\text{env}}_{\text{pass}}\$ (executable rollouts) - Test-passing subset in JSONL ([SWE-Agent](https://github.com/SWE-agent/SWE-agent) + [GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) rollouts on [SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench)): - `./env-native.jsonl` ## Model Zoo Trained checkpoints are released on Hugging Face: | Model | Description | Link | |------|-------------|------| | `daVinci-Dev-72B` | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B | | `daVinci-Dev-32B` | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B | | `daVinci-Dev-72B-MT` | **MT checkpoint** (after agent-native mid-training, **before SFT**) | https://huggingface.co/GAIR/daVinci-Dev-72B-MT | | `daVinci-Dev-32B-MT` | **MT checkpoint** (after agent-native mid-training, **before SFT**) | https://huggingface.co/GAIR/daVinci-Dev-32B-MT | ## Pipeline The GitHub repository contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build $\mathcal{D}^{\text{ctx}}_{\text{py}}$. | Pipeline | Description | Link | |----------|---------|-------------| | daVinci-Dev Pipeline | a high-performance pipeline used to build \$\mathcal{D}^{\text{ctx}}_{\text{py}}\$ | [`GAIR-NLP/daVinci-Dev`](https://github.com/GAIR-NLP/daVinci-Dev) | ## Converting Datasets into LLM-trainable text ### Converting PR structure \$\mathcal{D}^{\text{ctx}}_{\text{py}}\$ To convert the structured PR representation into a linearized, LLM-trainable format, follow: - https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/text_from_huggingface.md ### Converting executable rollouts \$\mathcal{D}^{\text{env}}_{\text{pass}}\$ - https://github.com/GAIR-NLP/daVinci-Dev/blob/main/env_traj_utils/README.md ## LLM enhancement details We used **Qwen/Qwen3-235B-A22B-Instruct-2507** (https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) to: - summarize PR content (e.g., description and commits), and - enhance commit messages into more explicit, training-friendly descriptions. ## Intended uses - Agentic software engineering mid-training (e.g., learning iterative edit patterns from PR histories). - Research on PR understanding, patch generation, and edit planning. - Building instruction-style corpora from structured PR data via the provided pipeline. ## License This project is a **mixed** release: - **Contextually-native PR-derived subset:** only PRs from repositories detected as having a **permissive license** are included. Each repo’s license is provided in `./ctx-native/filtered_repos/part-0000.parquet`. - **Environmentally-native subset:** derived from [**SWE-rebench**](https://huggingface.co/datasets/nebius/SWE-rebench), licensed under **CC-BY-4.0**. - **daVinci-Dev models:** released under [Qwen](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) license. Users should verify the licensing status of any generated code before using it in production. - **daVinci-Dev pipeline:** released under the [Apache-2.0](https://github.com/GAIR-NLP/daVinci-Dev/blob/main/LICENSE) license. Users are responsible for ensuring their downstream usage complies with the licenses of the underlying sources. ## Citation If you use this work, please cite the daVinci-Dev paper. ``` @misc{zeng2026davincidevagentnativemidtrainingsoftware, title={daVinci-Dev: Agent-native Mid-training for Software Engineering}, author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu}, year={2026}, eprint={2601.18418}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2601.18418}, } ```

<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 10px"> <img src="assets/sii.png" alt="SII" width="100px"> <img src="assets/GAIR_Logo2.png" alt="GAIR" width="100px"> </div> <div align="center"> [![论文]("https://img.shields.io/badge/Paper-PDF-1f6feb.svg")]("https://github.com/GAIR-NLP/daVinci-Dev/blob/main/daVinci-Dev.pdf") [![arXiv]("https://img.shields.io/badge/arXiv-2601.18418-b31b1b.svg")]("https://arxiv.org/pdf/2601.18418") [![GitHub仓库]("https://img.shields.io/badge/GitHub-Repository-green")]("https://github.com/GAIR-NLP/daVinci-Dev") [![🤗 Hugging Face 数据集]("https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue")]("https://huggingface.co/datasets/GAIR/daVinci-Dev") [![🤗 Hugging Face 模型]("https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue")]("https://huggingface.co/GAIR/daVinci-Dev-72B") </div> <h1 align="center">daVinci-Dev 数据集：面向软件工程的智能体原生中期训练</h1> <div align="center"> <img src="assets/teaser.png" width="100%" /> </div> 本数据集发布包含用于《daVinci-Dev：面向软件工程的智能体原生中期训练》研究的**智能体原生轨迹（agent-native trajectories）**。 ## 目录 - [数据集概览](#dataset-at-a-glance) - [数据集文件](#dataset-files) - [模型库](#model-zoo) - [处理流水线](#pipeline) - [将PR结构转换为大语言模型可训练文本](#converting-pr-structure-into-llm-trainable-text) - [大语言模型增强细节](#llm-enhancement-details) - [预期用途](#intended-uses) - [许可协议](#license) - [引用方式](#citation) ## 数据集概览本数据集包含两类互补数据源： 1. **上下文原生轨迹（Contextually-native trajectories）$mathcal{D}^{ ext{ctx}}_{ ext{py}}$（基于拉取请求构建，Python版本）** - 从GitHub拉取请求（pull requests, PR）构建而来 - 本开源发布仅收录来自采用**宽松许可协议（permissive license）**的仓库的PR - 该子集占全部基于PR构建语料库的约**60%**，总计约**410万条PR** - PR内容还通过大语言模型（Large Language Model, LLM）进行了摘要生成与增强处理（详见下文） - 数据以结构化Parquet格式存储。如需将其转换为大语言模型可训练文本，请参阅下文说明 2. **环境原生轨迹（Environmentally-native trajectories）$mathcal{D}^{ ext{env}}_{ ext{pass}}$（可执行展开轨迹，测试通过子集）** - 通过在[SWE-rebench]("https://huggingface.co/datasets/nebius/SWE-rebench")数据集的真实仓库中部署[**SWE-Agent**]("https://github.com/SWE-agent/SWE-agent")与[**GLM-4.6**]("https://huggingface.co/zai-org/GLM-4.6")收集得到 - 源数据集采用**CC-BY-4.0**许可："https://huggingface.co/datasets/nebius/SWE-rebench" ## 数据集文件 ### 上下文原生轨迹$mathcal{D}^{ ext{ctx}}_{ ext{py}}$（基于PR构建）这些Parquet分片存储了PR的结构化表示。 - 仓库元数据（包含检测到的许可协议）： - `./ctx-native/filtered_repos/part-0000.parquet` 该文件中每条数据对应一个经过筛选的仓库，包含`repo_id`、`full_name`、`description`、`language`、星标数以及`license_key`等字段（数据 schema：[`models.PublicRepo`]("https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L4")）。 - PR元数据（包含每条PR的基础信息的小型文件）： - `./ctx-native/filtered_prs/part-0000.parquet` - `./ctx-native/filtered_prs/part-0001.parquet` - … 每个文件中每条数据对应一条PR，包含标识符、标题/正文/作者元数据以及粗略的文件变更统计信息（数据 schema：[`models.PRMetadata`]("https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L23")）。 - 结构化PR轨迹（经大语言模型增强）： - `./ctx-native/llm_enhanced_prs/part-0000.parquet` - `./ctx-native/llm_enhanced_prs/part-0001.parquet` - `./ctx-native/llm_enhanced_prs/part-0002.parquet` - … 每个文件中每条数据对应一条PR，包含仓库/PR文本字段、相关议题内容、相关文件快照、经过优化的提交差异（commit diff）与提交信息，以及大语言模型生成的PR摘要（数据 schema：[`models.LLMEnhancedPRData`]("https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/models/models.go#L148")）。 ### 环境原生轨迹$mathcal{D}^{ ext{env}}_{ ext{pass}}$（可执行展开轨迹） - 测试通过子集（JSONL格式）：基于[SWE-rebench]("https://huggingface.co/datasets/nebius/SWE-rebench")数据集，由[SWE-Agent]("https://github.com/SWE-agent/SWE-agent")与[GLM-4.6]("https://huggingface.co/zai-org/GLM-4.6")生成的可执行展开轨迹： - `./env-native.jsonl` ## 模型库训练好的模型 checkpoint 已在Hugging Face平台发布： | 模型名称 | 描述 | 链接 | |------|-------------|------| | `daVinci-Dev-72B` | 最终模型（智能体原生中期训练 + 环境原生监督微调） | "https://huggingface.co/GAIR/daVinci-Dev-72B" | | `daVinci-Dev-32B` | 最终模型（智能体原生中期训练 + 环境原生监督微调） | "https://huggingface.co/GAIR/daVinci-Dev-32B" | | `daVinci-Dev-72B-MT` | **中期训练 checkpoint**（智能体原生中期训练完成后、监督微调前的模型） | "https://huggingface.co/GAIR/daVinci-Dev-72B-MT" | | `daVinci-Dev-32B-MT` | **中期训练 checkpoint**（智能体原生中期训练完成后、监督微调前的模型） | "https://huggingface.co/GAIR/daVinci-Dev-32B-MT" | ## 处理流水线本GitHub仓库包含一套高性能处理流水线，可调用GitHub API并构建用于生成$mathcal{D}^{ ext{ctx}}_{ ext{py}}$的结构化PR表示。 | 处理流水线 | 描述 | 链接 | |----------|---------|-------------| | daVinci-Dev Pipeline | 用于构建$mathcal{D}^{ ext{ctx}}_{ ext{py}}$的高性能处理流水线 | [`GAIR-NLP/daVinci-Dev`]("https://github.com/GAIR-NLP/daVinci-Dev") | ## 将PR结构转换为大语言模型可训练文本 ### 转换上下文原生PR数据集$mathcal{D}^{ ext{ctx}}_{ ext{py}}$ 如需将结构化PR表示转换为线性化、可用于大语言模型训练的格式，请遵循以下指南： - "https://github.com/GAIR-NLP/daVinci-Dev/blob/main/pipeline/text_from_huggingface.md" ### 转换可执行展开轨迹数据集$mathcal{D}^{ ext{env}}_{ ext{pass}}$ - "https://github.com/GAIR-NLP/daVinci-Dev/blob/main/env_traj_utils/README.md" ## 大语言模型增强细节本数据集使用**Qwen/Qwen3-235B-A22B-Instruct-2507**（"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507"）完成以下任务： - 对PR内容（如描述与提交记录）进行摘要生成 - 将提交信息优化为更清晰、更适配训练场景的表述 ## 预期用途 - 面向智能体软件工程的中期训练（例如从PR历史中学习迭代编辑模式） - PR理解、补丁生成与编辑规划相关研究 - 通过本仓库提供的处理流水线，从结构化PR数据构建指令风格语料库 ## 许可协议本项目采用**混合许可**方案： - **基于PR构建的上下文原生子集**：仅收录来自检测为采用**宽松许可协议**的仓库的PR。各仓库的许可协议信息可在`./ctx-native/filtered_repos/part-0000.parquet`中查询。 - **环境原生子集**：源自[**SWE-rebench**]("https://huggingface.co/datasets/nebius/SWE-rebench")数据集，采用**CC-BY-4.0**许可。 - **daVinci-Dev系列模型**：采用[Qwen]("https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE")许可发布。用户在将生成的代码用于生产环境前，应自行验证其许可状态。 - **daVinci-Dev处理流水线**：采用[Apache-2.0]("https://github.com/GAIR-NLP/daVinci-Dev/blob/main/LICENSE")许可发布。用户需确保其下游使用符合各底层数据源的许可协议要求。 ## 引用方式如您使用本数据集或相关工作，请引用daVinci-Dev相关论文： @misc{zeng2026davincidevagentnativemidtrainingsoftware, title={daVinci-Dev: Agent-native Mid-training for Software Engineering}, author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu}, year={2026}, eprint={2601.18418}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2601.18418}, }

提供机构：

maas

创建时间：

2026-01-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集