five

fineweb-math-corpus

收藏
魔搭社区2025-07-09 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/infly/fineweb-math-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder Dataset The OpenCoder dataset is composed of the following datasets: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1): the sft data used for opencoder sft-stage1 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2): the sft data used for opencoder sft-stage2 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus): the synthetic data & algorithmic corpus used for opencoder annealing * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus): the code-related page recalled from fineweb * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus): the math-related page recalled from fineweb **<-- you are here** * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta): the meta-data of RefineCode Detailed information about the data can be found in our [paper](https://arxiv.org/abs/2411.04905). ## opc-fineweb-math-corpus summary This math-related data from [Fineweb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) was specifically used in [OpenCoder](https://huggingface.co/papers/2411.04905) pre-training. We employ fastText in three iterative rounds to recall a final dataset of 55B code and math-related data. You can find code-related data at [OpenCoder-LLM/fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus). *This work belongs to [INF](https://www.infly.cn/).* ## Citation Information Please consider citing our [paper](https://arxiv.org/abs/2411.04905) if you find this dataset useful: ``` @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} } ```

![image](https://github.com/user-attachments/assets/66e5afec-060d-43c0-937e-dd7b6b1a26ef) # OpenCoder 数据集 OpenCoder 数据集由以下数据集构成: * [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1):用于 OpenCoder 监督微调第一阶段(opc-sft-stage1)的监督微调(Supervised Fine-Tuning, SFT)数据 * [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2):用于 OpenCoder 监督微调第二阶段(opc-sft-stage2)的监督微调数据 * [opc-annealing-corpus](https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus):用于 OpenCoder 退火训练的合成数据与算法语料库 * [opc-fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus):从 FineWeb 中召回的与代码相关的页面数据 * [opc-fineweb-math-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-math-corpus):从 FineWeb 中召回的与数学相关的页面数据 **<-- 当前所在位置** * [refineCode-code-corpus-meta](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta):RefineCode 的元数据 有关该数据集的详细信息,请参阅我们的[论文](https://arxiv.org/abs/2411.04905)。 ## opc-fineweb-math-corpus 数据集摘要 该批源自[FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)的数学相关数据,专门用于[OpenCoder](https://huggingface.co/papers/2411.04905)的预训练阶段。 我们通过三轮迭代的 fastText 方法,最终召回得到规模达550亿的代码与数学相关数据集。 您可通过 [OpenCoder-LLM/fineweb-code-corpus](https://huggingface.co/datasets/OpenCoder-LLM/fineweb-code-corpus) 获取代码相关数据。 *本项目隶属于[INF](https://www.infly.cn/)。* ## 引用说明 若您的工作使用了本数据集,请引用我们的[论文](https://arxiv.org/abs/2411.04905): @inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} }
提供机构:
maas
创建时间:
2024-11-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作