opencoder-sft-stage1

Name: opencoder-sft-stage1
Creator: maas
Published: 2025-11-27 16:17:46
License: 暂无描述

魔搭社区2025-11-27 更新2024-11-30 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/opencoder-sft-stage1

下载链接

链接失效反馈

官方服务：

资源简介：

![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder Dataset The OpenCoder dataset is composed of the following datasets: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1 **<-- you are here** * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905). ## sft-stage1 summary This dataset is used in OpenCoder's Stage 1 and consists of three parts: * **Filtered_infinity_instruct**: Filtered from [infinity_instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) using LLM to extract code-related content. Since the original outputs were often low-quality (e.g., overly concise responses, inconsistent code formatting), we recommend regenerating them with a stronger LLM based on the given instructions. * **Realuser_instruct**: Extracted bilingual code-related instructions from GPT conversation histories like [ShareGPT](https://github.com/domeccleston/sharegpt) and [WildChat](https://huggingface.co/datasets/allenai/WildChat). Low-quality responses were regenerated.This portion of data, sampled from real users, is of high quality and greatly enhances the practical performance of code LLMs * **Largescale_diverse_instruct**: Generated using a pipeline based on seeds like CommonCrawl and Source Code. This dataset provides diverse code-related instructions. ## How to use it ```python from datasets import load_dataset realuser_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "realuser_instruct") filtered_infinity_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "filtered_infinity_instryuct") largescale_diverse_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "largescale_diverse_instruct") ``` ## Citation Information Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful: ``` @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} } ```

# OpenCoder数据集 OpenCoder数据集由以下数据集组成： * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): 用于OpenCoder监督微调（Supervised Fine-Tuning，SFT）第一阶段的SFT数据 **<-- 您当前位于此处** * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): 用于OpenCoder SFT第二阶段的SFT数据 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): 用于OpenCoder退火训练的合成数据与算法语料库 * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): 从FineWeb中召回的代码相关网页数据 * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): 从FineWeb中召回的数学相关网页数据 * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): RefineCode的元数据该数据集的详细信息可参阅我们的[研究论文](https://arxiv.org/abs/2411.04905)。 ## SFT第一阶段数据集概况本数据集应用于OpenCoder的第一阶段训练，包含三个子部分： * **Filtered_infinity_instruct**：通过大语言模型（Large Language Model，LLM）从[infinity_instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)中筛选提取代码相关内容得到。由于原始输出质量普遍偏低（例如回答过于简洁、代码格式不一致），建议基于给定指令使用更强的LLM重新生成回答内容。 * **Realuser_instruct**：从[ShareGPT](https://github.com/domeccleston/sharegpt)与[WildChat](https://huggingface.co/datasets/allenai/WildChat)等GPT对话历史中提取的双语代码相关指令数据，并对低质量回答进行了重新生成。该部分数据源自真实用户采样，质量优异，可显著提升代码大语言模型的实际应用性能。 * **Largescale_diverse_instruct**：基于CommonCrawl与源代码等种子数据通过流水线生成的数据集，提供多样化的代码相关指令。 ## 使用方法 python from datasets import load_dataset realuser_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "realuser_instruct") filtered_infinity_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "filtered_infinity_instryuct") largescale_diverse_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "largescale_diverse_instruct") ## 引用信息若您的工作使用了本数据集，请引用我们的[研究论文](https://arxiv.org/abs/2411.04905)： @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} }

提供机构：

maas

创建时间：

2024-11-11

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是OpenCoder SFT阶段1的训练数据，由三个部分组成：Filtered_infinity_instruct通过大模型从infinity_instruct中提取代码相关内容，Realuser_instruct源自真实用户对话历史的高质量双语指令，以及Largescale_diverse_instruct基于种子生成的多样化代码指令。

以上内容由遇见数据集搜集并总结生成