five

NextCoderDataset-Conversational

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/microsoft/NextCoderDataset-Conversational
下载链接
链接失效反馈
官方服务:
资源简介:
# NextCoderDataset-Conversational <p align="center"> <a href="https://github.com/microsoft/NextCoder">GitHub</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/">Paper</a> </p> > NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits (ICML'2025) ## Data Overview NextCoderDataset-Conversational is the multi-turn conversational-variant of synthetic dataset, used for training models on code-editing scenarios and compromised of around 57k samples across 8 different programming languages: Python, Java, C++, C, Rust, Javascript, Go and Kotlin. This is used to finetune the **[NextCoder family](https://huggingface.co/collections/microsoft/nextcoder-6815ee6bfcf4e42f20d45028)** models using the novel **Selective Knowledge Transfer** finetuning methodology. ## Data Distribution - The samples in NextCoderDataset are generated using GPT-4o and Llama-3.3-70B-Instruct model using a filtered version of [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) with samples from the corresponding languages. - We do not include the any benchmark or test splits | Language | Unique Count | |----------|-------| | JavaScript | 9261 | | Python | 8800 | | C | 8213 | | C++ | 7367 | | Rust | 6398 | | Go | 6028 | | Kotlin | 5785 | | Java | 5590 | ## Data Fields | Field | Type | Description | |--------------------------|----------|-----------------------------------------------------------------------------| | messages | array\<string\> | user-assistant conversation for editing the source code using instruction | ## Dataset Characterization - Data Collection Method - [Synthetic] - Labelling Method - [Synthetic] ## Use Case - Training/Finetuning of Large Language Models on diverse code-editing scenarios ## Intended Use The NextCoderDataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. However, user elects to use the dataset must be responsible for checking if the dataset license is fit for the intended purpose. ## Citation ```bibtex @inproceedings{aggarwal2025nextcoder, author = {Aggarwal, Tushar and Singh, Swayam and Awasthi, Abhijeet and Kanade, Aditya and Natarajan, Nagarajan}, title = {NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits}, booktitle = {International Conference on Machine Learning}, year = {2025}, url = {https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/}, } ```

# NextCoder数据集-对话式(NextCoderDataset-Conversational) <p align="center"> <a href="https://github.com/microsoft/NextCoder">GitHub</a>&nbsp&nbsp | &nbsp&nbsp <a href="https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/">论文</a> </p> > 《NextCoder:面向多样化代码编辑的代码大语言模型(Code LMs)鲁棒适配》(ICML 2025) ## 数据集概览 NextCoder数据集-对话式是合成数据集的多轮对话变体,专为代码编辑场景下的模型训练设计,共包含约5.7万条样本,覆盖8种主流编程语言:Python、Java、C++、C、Rust、JavaScript、Go及Kotlin。 本数据集可通过新颖的**选择性知识迁移(Selective Knowledge Transfer)**微调方法,对**NextCoder系列模型(NextCoder family)**进行微调。 ## 数据分布 - NextCoder数据集的样本基于过滤后的[StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata)数据集生成,通过GPT-4o与Llama-3.3-70B-Instruct模型生成对应编程语言的样本。 - 本数据集未包含任何基准测试集或测试划分子集。 | 编程语言 | 唯一样本数 | |----------|-------| | JavaScript | 9261 | | Python | 8800 | | C | 8213 | | C++ | 7367 | | Rust | 6398 | | Go | 6028 | | Kotlin | 5785 | | Java | 5590 | ## 数据字段 | 字段 | 类型 | 描述 | |--------------------------|----------|-----------------------------------------------------------------------------| | messages(对话消息) | 字符串数组(array<string>) | 用于基于指令编辑源代码的用户-助手多轮对话 | ## 数据集特征 - 数据收集方式 - 合成生成 - 标注方式 - 合成生成 ## 使用场景 - 针对多样化代码编辑场景的大语言模型(Large Language Model, LLM)训练与微调 ## 预期用途 NextCoder数据集旨在供社区用于持续改进开源模型,可免费用于模型训练。但使用本数据集的用户需自行核查数据集许可证是否符合其预期用途。 ## 引用格式 bibtex @inproceedings{aggarwal2025nextcoder, author = {Aggarwal, Tushar and Singh, Swayam and Awasthi, Abhijeet and Kanade, Aditya and Natarajan, Nagarajan}, title = {NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits}, booktitle = {International Conference on Machine Learning}, year = {2025}, url = {https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/}, }
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作