opc-annealing-corpus
收藏魔搭社区2026-01-02 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/opc-annealing-corpus
下载链接
链接失效反馈官方服务:
资源简介:

# OpenCoder Dataset
The OpenCoder dataset is composed of the following datasets:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing **<-- you are here**
* [fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb
* [fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode
Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905).
## opc-annealing-corpus summary
This corpus is an additional component incorporated into OpenCoder during the annealing phase, beyond the original distribution:
- **algorithmic_corpus**: Algorithm-related code sampled from *[The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2)*.
- **synthetic_code_snippet**: High-quality code snippets generated by rewriting *algorithmic_corpus* as seeds.
- **synthetic_qa**: High-quality Q&A pairs generated by adapting *algorithmic_corpus* as seeds.
Our ablation experiments validated the effectiveness of this batch of synthetic data.

## Citation Information
Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful:
```
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
```

# OpenCoder 数据集
OpenCoder 数据集由以下数据集构成:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于 OpenCoder 监督微调第一阶段的监督微调(Supervised Fine-Tuning, SFT)数据
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于 OpenCoder 监督微调第二阶段的 SFT 数据
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于 OpenCoder 退火阶段的合成数据与算法语料库 **<-- 您当前位于此数据集**
* [fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从 FineWeb 中召回的代码相关页面数据
* [fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从 FineWeb 中召回的数学相关页面数据
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode 的元数据
该数据集的详细信息可参阅我们的[论文](https://arxiv.org/abs/2411.04905)。
## opc-annealing-corpus 摘要
本语料库是 OpenCoder 在退火训练阶段引入的、超出原始数据分布的额外组成部分:
- **algorithmic_corpus**:从 *[The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2)* 中采样的算法相关代码。
- **synthetic_code_snippet**:以 algorithmic_corpus 为种子,通过重写生成的高质量代码片段。
- **synthetic_qa**:以 algorithmic_corpus 为种子,通过适配生成的高质量问答对。
我们的消融实验验证了这批合成数据的有效性。

## 引用说明
若您认为本数据集对研究有所帮助,请引用我们的[论文](https://arxiv.org/abs/2411.04905):
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
提供机构:
maas
创建时间:
2024-11-15



