fineweb-math-corpus
收藏魔搭社区2025-07-09 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/infly/fineweb-math-corpus
下载链接
链接失效反馈官方服务:
资源简介:

# OpenCoder Dataset
The OpenCoder dataset is composed of the following datasets:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb **<-- you are here**
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode
Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905).
## opc-fineweb-math-corpus summary
This math-related data from [Fineweb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) was specifically used in [OpenCoder](https://huggingface.co/papers/2411.04905) pre-training.
We employ fastText in three iterative rounds to recall a final dataset of 55B code and math-related data.
You can find code-related data at [OpenCoder-LLM/fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus).
*This work belongs to [INF](https://www.infly.cn/).*
## Citation Information
Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful:
```
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
```

# OpenCoder 数据集
OpenCoder 数据集由以下数据集构成:
* [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于 OpenCoder 监督微调第一阶段(opc-sft-stage1)的监督微调(Supervised Fine-Tuning, SFT)数据
* [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于 OpenCoder 监督微调第二阶段(opc-sft-stage2)的监督微调数据
* [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于 OpenCoder 退火训练的合成数据与算法语料库
* [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从 FineWeb 中召回的与代码相关的页面数据
* [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从 FineWeb 中召回的与数学相关的页面数据 **<-- 当前所在位置**
* [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode 的元数据
有关该数据集的详细信息,请参阅我们的[论文](https://arxiv.org/abs/2411.04905)。
## opc-fineweb-math-corpus 数据集摘要
该批源自[FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)的数学相关数据,专门用于[OpenCoder](https://huggingface.co/papers/2411.04905)的预训练阶段。
我们通过三轮迭代的 fastText 方法,最终召回得到规模达550亿的代码与数学相关数据集。
您可通过 [OpenCoder-LLM/fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus) 获取代码相关数据。
*本项目隶属于[INF](https://www.infly.cn/)。*
## 引用说明
若您的工作使用了本数据集,请引用我们的[论文](https://arxiv.org/abs/2411.04905):
@inproceedings{Huang2024OpenCoderTO,
title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},
author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},
year = {2024},
url = {https://arxiv.org/pdf/2411.04905}
}
提供机构:
maas
创建时间:
2024-11-22



