microsoft/NextCoderDataset
收藏Hugging Face2025-07-08 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/microsoft/NextCoderDataset
下载链接
链接失效反馈官方服务:
资源简介:
NextCoder数据集是一个用于代码编辑场景的合成数据集,包含8种不同编程语言(Python、Java、C++、C、Rust、JavaScript、Go和Kotlin)的大约381k个样本。该数据集用于通过选择性知识迁移的新型微调方法来微调NextCoder系列模型。数据集由GPT-4o和Llama-3.3-70B-Instruct模型生成,使用来自StarCoderData数据集的过滤样本。
The NextCoderDataset is a synthetic dataset for code-editing scenarios, comprising approximately 381k samples across 8 different programming languages: Python, Java, C++, C, Rust, JavaScript, Go, and Kotlin. It is used to fine-tune the NextCoder family of models with a novel Selective Knowledge Transfer methodology. The dataset is generated using GPT-4o and Llama-3.3-70B-Instruct models from filtered samples of the StarCoderData dataset.
提供机构:
microsoft



