five

opc-sft-stage2

收藏
魔搭社区2026-04-23 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/opc-sft-stage2
下载链接
链接失效反馈
官方服务:
资源简介:
![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder Dataset The OpenCoder dataset is composed of the following datasets: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2 **<-- you are here** * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905). ## sft-stage2 summary This dataset is used in OpenCoder's Stage 2 and consists of four parts: * **educational_instruct**: Using the [algorithmic corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus) as a seed, we generated (instruction, code, test case) triples, validated through a Python compiler. Notably, the inclusion of test cases provides a valuable signal for code RL. * **evol_instruct**: Directly using the open-source version [MagicCoder-Evol-Instruct-110k](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K). * **mceval_instruct**: Directly using the open-source version [McEval-Instruct](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval-Instruct). * **package_instruct**: We extracted common interface documentation from pydoc and used it as a seed to generate Python package-related questions. ## How to use it ```python from datasets import load_dataset educational_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "educational_instruct") evol_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "evol_instruct") mceval_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "mceval_instruct") package_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "package_instruct") ``` ## Citation Information Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful: ``` @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} } ```

![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder 数据集 本OpenCoder数据集由以下子数据集构成: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于opencoder监督微调(Supervised Fine-Tuning,SFT)第一阶段的监督微调数据集 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于opencoder监督微调第二阶段的监督微调数据集 **<-- 你当前所在的数据集** * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于opencoder退火训练的合成数据与算法语料库 * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从FineWeb中召回的代码相关页面数据集 * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从FineWeb中召回的数学相关页面数据集 * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode数据集的元数据 该数据集的详细信息可参阅我们的[论文](https://arxiv.org/abs/2411.04905)。 ## SFT第二阶段数据集概览 本数据集用于OpenCoder的第二阶段训练,包含四个子部分: * **educational_instruct**:以[algorithmic corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus)作为种子数据集,我们生成了(指令、代码、测试用例)三元组,并通过Python编译器完成验证。值得注意的是,测试用例的加入为代码强化学习(Reinforcement Learning,RL)提供了极具价值的训练信号。 * **evol_instruct**:直接采用开源版本[MagicCoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K)数据集。 * **mceval_instruct**:直接采用开源版本[McEval-Instruct](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval-Instruct)数据集。 * **package_instruct**:我们从pydoc中提取通用接口文档作为种子,生成与Python软件包相关的问题数据集。 ## 使用方法 python from datasets import load_dataset educational_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "educational_instruct") evol_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "evol_instruct") mceval_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "mceval_instruct") package_instruct = load_dataset("OpenCoder-LLM/opc-sft-stage2", "package_instruct") ## 引用信息 若您使用本数据集,请引用我们的[论文](https://arxiv.org/abs/2411.04905): @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} }
提供机构:
maas
创建时间:
2024-11-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作