OpenCoder-LLM/opc-sft-stage1
收藏Hugging Face2024-11-24 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/OpenCoder-LLM/opc-sft-stage1
下载链接
链接失效反馈官方服务:
资源简介:
OpenCoder数据集由多个部分组成,包括opc-sft-stage1、opc-sft-stage2、opc-annealing-corpus、opc-fineweb-code-corpus、opc-fineweb-math-corpus和refineCode-code-corpus-meta。其中,sft-stage1数据集用于OpenCoder的第一阶段,包含三个部分:Filtered_infinity_instruct是从infinity_instruct中筛选出的代码相关内容,Realuser_instruct是从GPT对话历史中提取的双语代码相关指令,Largescale_diverse_instruct是基于CommonCrawl和源代码种子生成的多样化代码相关指令。这些数据集旨在提高代码大语言模型的实际性能。
The OpenCoder dataset consists of multiple components, including opc-sft-stage1, opc-sft-stage2, opc-annealing-corpus, opc-fineweb-code-corpus, opc-fineweb-math-corpus, and refineCode-code-corpus-meta. Specifically, the sft-stage1 dataset is used in the first stage of OpenCoder and comprises three parts: Filtered_infinity_instruct, which is filtered from infinity_instruct to extract code-related content; Realuser_instruct, which is extracted from GPT conversation histories to provide bilingual code-related instructions; and Largescale_diverse_instruct, which is generated using a pipeline based on seeds like CommonCrawl and Source Code to provide diverse code-related instructions. These datasets aim to enhance the practical performance of code large language models.
提供机构:
OpenCoder-LLM



