five

NextCoderDataset

收藏
魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/NextCoderDataset
下载链接
链接失效反馈
官方服务:
资源简介:
# NextCoderDataset <p align="center"> <a href="https://github.com/microsoft/NextCoder">GitHub</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/">Paper</a> </p> > NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits (ICML'2025) ## Data Overview NextCoderdataset is the instruction-variant of synthetic dataset, used for training models on code-editing scenarios and compromised of around 381k (127k*3) samples across 8 different programming languages: Python, Java, C++, C, Rust, Javascript, Go and Kotlin. This is used to finetune the **[NextCoder family](https://huggingface.co/collections/microsoft/nextcoder-6815ee6bfcf4e42f20d45028)** models using the novel **Selective Knowledge Transfer** finetuning methodology. ## Data Distribution - The samples in NextCoderDataset are generated using GPT-4o and Llama-3.3-70B-Instruct model using a filtered version of [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) with samples from the corresponding languages. - We do not include the any benchmark or test splits | Language | Unique Count | |----------|--------------| | JavaScript | 16030 | | Python | 15279 | | C | 17153 | | C++ | 17337 | | Rust | 16438 | | Go | 15204 | | Kotlin | 13272 | | Java | 16328 | ## Data Fields | Field | Type | Description | |--------------------------|----------|-----------------------------------------------------------------------------| | prompt | string | Instruction prompt along with the input code. | | completion | string | Ground truth of the editied original code as per instruction | ## Dataset Characterization - Data Collection Method - [Synthetic] - Labelling Method - [Synthetic] ## Use Case - Training/Finetuning of Large Language Models on diverse code-editing scenarios ## Intended Use The NextCoderDataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. However, user elects to use the dataset must be responsible for checking if the dataset license is fit for the intended purpose. ## Citation ```bibtex @inproceedings{aggarwal2025nextcoder, author = {Aggarwal, Tushar and Singh, Swayam and Awasthi, Abhijeet and Kanade, Aditya and Natarajan, Nagarajan}, title = {NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits}, booktitle = {International Conference on Machine Learning}, year = {2025}, url = {https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/}, } ```

# NextCoder数据集(NextCoderDataset) <p align="center"> <a href="https://github.com/microsoft/NextCoder">GitHub</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/">论文</a> </p> > NextCoder:面向多样化代码编辑的代码大语言模型(Code Language Models,简称Code LMs)鲁棒适配(ICML 2025) ## 数据集概览 NextCoder数据集是指令微调版本的合成数据集,用于针对代码编辑场景训练模型,共包含约38.1万(12.7万×3)条样本,覆盖Python、Java、C++、C、Rust、JavaScript、Go及Kotlin共8种编程语言。 本数据集可借助新颖的**选择性知识迁移(Selective Knowledge Transfer)**微调方法,对**[NextCoder系列模型](https://huggingface.co/collections/microsoft/nextcoder-6815ee6bfcf4e42f20d45028)**进行微调。 ## 数据分布 - NextCoder数据集的样本基于过滤后的[StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata)生成,选取对应编程语言的样本,并通过GPT-4o与Llama-3.3-70B-Instruct模型完成生成。 - 本数据集未包含任何基准测试集或测试划分。 | 编程语言 | 唯一样本数 | |----------|----------| | JavaScript | 16030 | | Python | 15279 | | C | 17153 | | C++ | 17337 | | Rust | 16438 | | Go | 15204 | | Kotlin | 13272 | | Java | 16328 | ## 数据字段 | 字段名 | 数据类型 | 描述 | |----------|----------|-----------------------------------------------------------------------------| | prompt | 字符串 | 包含输入代码的指令提示 | | completion | 字符串 | 符合指令要求的编辑后原始代码的真值标注 | ## 数据集特征 - 数据采集方式 - [合成生成] - 标注方式 - [合成生成] ## 应用场景 - 针对多样化代码编辑场景的大语言模型训练与微调 ## 预期用途 NextCoder数据集旨在供社区用于持续改进开源模型,可免费用于模型训练。但使用者需自行核查本数据集的授权协议是否适配其预期用途。 ## 引用 bibtex @inproceedings{aggarwal2025nextcoder, author = {Aggarwal, Tushar and Singh, Swayam and Awasthi, Abhijeet and Kanade, Aditya and Natarajan, Nagarajan}, title = {NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits}, booktitle = {International Conference on Machine Learning}, year = {2025}, url = {https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/}, }
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作