five

fineweb-code-corpus

收藏
魔搭社区2025-12-04 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/fineweb-code-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder Dataset The OpenCoder dataset is composed of the following datasets: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb **<-- you are here** * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905). ## opc-fineweb-code-corpus This code-related data from [Fineweb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) was specifically used in [OpenCoder](https://huggingface.co/papers/2411.04905) pre-training. We employ fastText in three iterative rounds to recall a final dataset of 55B code and math-related data. You can find math-related data at [OpenCoder-LLM/fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus). *This work belongs to [INF](https://www.infly.cn/).* ## Citation Information Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful: ``` @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} } ```

![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder 数据集 OpenCoder数据集由以下数据集组成: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于opencoder SFT阶段1的监督微调(Supervised Fine-Tuning,SFT)数据 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于opencoder SFT阶段2的监督微调数据 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于opencoder退火预训练的合成数据与算法语料库 * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从FineWeb中召回的代码相关页面 **<-- 你当前位于此处** * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从FineWeb中召回的数学相关页面 * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode的元数据 有关该数据集的详细信息可参阅我们的[研究论文](https://arxiv.org/abs/2411.04905)。 ## opc-fineweb-code-corpus 该代码相关数据源自[FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1),专门用于OpenCoder的预训练流程。我们通过三轮迭代流程使用fastText模型,最终召回得到规模达550亿Token的代码与数学相关数据集。数学相关数据集可于[OpenCoder-LLM/fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus)获取。 *本项目隶属于[INF](https://www.infly.cn/).* ## 引用信息 若您认为本数据集对您的研究有所帮助,请引用我们的[研究论文](https://arxiv.org/abs/2411.04905): @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} }
提供机构:
maas
创建时间:
2024-11-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作