opencoder-sft-stage2
收藏魔搭社区2026-01-02 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/opencoder-sft-stage2
下载链接
链接失效反馈官方服务:
资源简介:

# OpenCoder Dataset
The OpenCoder dataset is composed of the following datasets:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2 **<-- you are here**
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode
Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905).
## sft-stage2 summary
This dataset is used in OpenCoder's Stage 2 and consists of four parts:
* **educational_instruct**: Using the [algorithmic corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus) as a seed, we generated (instruction, code, test case) triples, validated through a Python compiler. Notably, the inclusion of test cases provides a valuable signal for code RL.
* **evol_instruct**: Directly using the open-source version [MagicCoder-Evol-Instruct-110k](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K).
* **mceval_instruct**: Directly using the open-source version [McEval-Instruct](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval-Instruct).
* **package_instruct**: We extracted common interface documentation from pydoc and used it as a seed to generate Python package-related questions.
## How to use it
```python
from datasets import load_dataset
educational_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "educational_instruct")
evol_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "evol_instruct")
mceval_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "mceval_instruct")
package_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "package_instruct")
```
## Citation Information
Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful:
```
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
```
# OpenCoder 数据集
OpenCoder 数据集由以下数据集组成:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于OpenCoder监督微调(Supervised Fine-Tuning, SFT)阶段1的训练数据
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于OpenCoder监督微调阶段2的训练数据 **<-- 您当前所在位置**
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于OpenCoder退火训练的合成数据与算法语料库
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从FineWeb中召回的代码相关页面
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从FineWeb中召回的数学相关页面
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode的元数据
该数据集的详细信息可参阅我们的[论文](https://arxiv.org/abs/2411.04905)。
## SFT阶段2数据集概览
本数据集用于OpenCoder的第二阶段训练,共包含四个组成部分:
* **educational_instruct**:以[算法语料库](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus)为种子数据,我们生成了(指令、代码、测试用例)三元组,并通过Python编译器完成验证。值得注意的是,测试用例的加入为代码强化学习提供了极具价值的训练信号。
* **evol_instruct**:直接采用开源版本[MagicCoder-Evol-Instruct-110k](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K)。
* **mceval_instruct**:直接采用开源版本[McEval-Instruct](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval-Instruct)。
* **package_instruct**:我们从pydoc中提取通用接口文档,并以此为种子生成了与Python包相关的问题。
## 使用方法
python
from datasets import load_dataset
educational_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "educational_instruct")
evol_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "evol_instruct")
mceval_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "mceval_instruct")
package_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "package_instruct")
## 引用信息
若您发现本数据集对研究有所助益,请引用我们的[论文](https://arxiv.org/abs/2411.04905):
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
提供机构:
maas
创建时间:
2024-11-11



