opencoder-sft-stage1
收藏魔搭社区2025-11-27 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/opencoder-sft-stage1
下载链接
链接失效反馈官方服务:
资源简介:

# OpenCoder Dataset
The OpenCoder dataset is composed of the following datasets:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1 **<-- you are here**
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode
Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905).
## sft-stage1 summary
This dataset is used in OpenCoder's Stage 1 and consists of three parts:
* **Filtered_infinity_instruct**: Filtered from [infinity_instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) using LLM to extract code-related content. Since the original outputs were often low-quality (e.g., overly concise responses, inconsistent code formatting), we recommend regenerating them with a stronger LLM based on the given instructions.
* **Realuser_instruct**: Extracted bilingual code-related instructions from GPT conversation histories like [ShareGPT](https://github.com/domeccleston/sharegpt) and [WildChat](https://huggingface.co/datasets/allenai/WildChat). Low-quality responses were regenerated.This portion of data, sampled from real users, is of high quality and greatly enhances the practical performance of code LLMs
* **Largescale_diverse_instruct**: Generated using a pipeline based on seeds like CommonCrawl and Source Code. This dataset provides diverse code-related instructions.
## How to use it
```python
from datasets import load_dataset
realuser_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "realuser_instruct")
filtered_infinity_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "filtered_infinity_instryuct")
largescale_diverse_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "largescale_diverse_instruct")
```
## Citation Information
Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful:
```
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
```
# OpenCoder数据集
OpenCoder数据集由以下数据集组成:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): 用于OpenCoder监督微调(Supervised Fine-Tuning,SFT)第一阶段的SFT数据 **<-- 您当前位于此处**
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): 用于OpenCoder SFT第二阶段的SFT数据
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): 用于OpenCoder退火训练的合成数据与算法语料库
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): 从FineWeb中召回的代码相关网页数据
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): 从FineWeb中召回的数学相关网页数据
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): RefineCode的元数据
该数据集的详细信息可参阅我们的[研究论文](https://arxiv.org/abs/2411.04905)。
## SFT第一阶段数据集概况
本数据集应用于OpenCoder的第一阶段训练,包含三个子部分:
* **Filtered_infinity_instruct**:通过大语言模型(Large Language Model,LLM)从[infinity_instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)中筛选提取代码相关内容得到。由于原始输出质量普遍偏低(例如回答过于简洁、代码格式不一致),建议基于给定指令使用更强的LLM重新生成回答内容。
* **Realuser_instruct**:从[ShareGPT](https://github.com/domeccleston/sharegpt)与[WildChat](https://huggingface.co/datasets/allenai/WildChat)等GPT对话历史中提取的双语代码相关指令数据,并对低质量回答进行了重新生成。该部分数据源自真实用户采样,质量优异,可显著提升代码大语言模型的实际应用性能。
* **Largescale_diverse_instruct**:基于CommonCrawl与源代码等种子数据通过流水线生成的数据集,提供多样化的代码相关指令。
## 使用方法
python
from datasets import load_dataset
realuser_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "realuser_instruct")
filtered_infinity_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "filtered_infinity_instryuct")
largescale_diverse_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage1", "largescale_diverse_instruct")
## 引用信息
若您的工作使用了本数据集,请引用我们的[研究论文](https://arxiv.org/abs/2411.04905):
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
提供机构:
maas
创建时间:
2024-11-11



