proof-pile-2

Name: proof-pile-2
Creator: maas
Published: 2025-12-05 11:51:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/EleutherAI/proof-pile-2

下载链接

链接失效反馈

官方服务：

资源简介：

<img src="proofpile_logo.jpg" width="500"> [ArXiv](http://arxiv.org/abs/2310.10631) | [Models](https://huggingface.co/EleutherAI/llemma_34b) | [Data](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [Code](https://github.com/EleutherAI/math-lm) | [Blog](https://blog.eleuther.ai/llemma/) | [Sample Explorer](https://llemma-demo.github.io/) [Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/) The **Proof-Pile-2** is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b) models. It consists of three subsets: - `arxiv` (29B tokens): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - `open-web-math` (15B tokens): The [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) dataset, which contains much of the high-quality mathematical text from the internet. - `algebraic-stack` (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics. You can download the dataset as follows ```python from datasets import load_dataset ds = load_dataset("EleutherAI/proof-pile-2") # To load only a specific subset, pass it as an argument, e.g ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv") ``` ### Schema Each dataset row has the following structure ```python { "text": ..., # document text "meta": ..., # JSON string of metadata, schema specific to data source } ``` ### Dataset Contents For detailed documentation of the ArXiv and web subsets, refer to [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math). The following table enumerates the contents of the AlgebraicStack by programming language. The AlgebraicStack is filtered to only include documents that contain mathematics, as judged by hand-crafted, language-specific heuristics. | Language | AlgebraicStack tokens | |-----------|-----------------------| | Agda | 35.2 M | | C | 25.1 M | | C++ | 954.1 M | | Coq | 281.9 M | | Fortran | 724.9 M | | GAP | 3.6 M | | Haskell | 9.1 M | | Idris | 10.9 M | | Isabelle | 1,089.7 M | | Julia | 531.0 M | | Jupyter | 199.1 M | | Lean | 285.6 M | | Maple | 2.0 M | | Matlab | 65.8 M | | Python | 6,098.8 M | | R | 71.3 M | | Tex | 567.7 M | | **Total** | **10,955.7 M** | ### License We do not alter the license of any of the underlying data. ### Version History **v1.1.0**: Contains an updated version of OpenWebMath, precisely the one available at [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math). This version of OpenWebMath has slightly improved filtering, for example, removal of very short documents. **v1.0.0**: The data used to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b). Uses a development version of OpenWebMath. ### Citation For the entire Proof-Pile-2, cite ``` @misc{azerbayev2023llemma, title={Llemma: An Open Language Model For Mathematics}, author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck}, year={2023}, eprint={2310.10631}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` For the ArXiv subset, cite ``` @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } ``` For OpenWebMath, cite ``` @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} } ```

<img src="proofpile_logo.jpg" width="500"> [ArXiv论文](http://arxiv.org/abs/2310.10631) | [模型](https://huggingface.co/EleutherAI/llemma_34b) | [数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [代码](https://github.com/EleutherAI/math-lm) | [博客](https://blog.eleuther.ai/llemma/) | [样本探索器](https://llemma-demo.github.io/) [张吉尔·阿塞巴耶夫（Zhangir Azerbayev）](https://zhangir-azerbayev.github.io/), [黑莉·舍尔科普夫（Hailey Schoelkopf）](https://github.com/haileyschoelkopf), [基兰·帕斯特（Keiran Paster）](https://keirp.com), [马尔科·多斯·桑托斯（Marco Dos Santos）](https://github.com/dsantosmarco), [斯蒂芬·麦基尔（Stephen McAleer）](https://www.andrew.cmu.edu/user/smcaleer/), [阿尔伯特·Q·江（Albert Q. Jiang）](https://albertqjiang.github.io/), [邓佳（Jia Deng）](https://www.cs.princeton.edu/~jiadeng/), [斯特拉·比德曼（Stella Biderman）](https://www.stellabiderman.com/), [肖恩·韦莱克（Sean Welleck）](https://wellecks.com/) # **Proof-Pile-2** 数据集说明 **Proof-Pile-2** 是一个包含550亿Token(Token)的数学与科学文档数据集。本数据集专为训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)模型而构建。数据集包含三个子模块： - `arxiv`（290亿Token）：源自[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)的ArXiv子数据集 - `open-web-math`（150亿Token）：[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)数据集，收录了互联网上大量高质量数学文本 - `algebraic-stack`（110亿Token）：全新数学代码数据集，涵盖数值计算、计算机代数与形式化数学内容可通过以下方式下载该数据集： python from datasets import load_dataset ds = load_dataset("EleutherAI/proof-pile-2") # 若仅需加载特定子模块，可将其作为参数传入，例如 ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv") ## 数据结构（Schema）数据集的每一行均采用如下结构： python { "text": ..., # 文档文本 "meta": ..., # 元数据JSON字符串，格式由数据源决定 } ## 数据集内容说明关于ArXiv与网页子数据集的详细文档，请参阅[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)与[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)。下表按编程语言枚举了AlgebraicStack的内容构成：AlgebraicStack已通过人工编写的语言专属启发式规则进行过滤，仅保留包含数学内容的文档。 | 编程语言 | AlgebraicStack Token数 | |-----------|-----------------------| | Agda | 3520万 | | C | 2510万 | | C++ | 9.541亿 | | Coq | 2.819亿 | | Fortran | 7.249亿 | | GAP | 360万 | | Haskell | 910万 | | Idris | 1090万 | | Isabelle | 10.897亿 | | Julia | 5.310亿 | | Jupyter | 1.991亿 | | Lean | 2.856亿 | | Maple | 200万 | | Matlab | 6580万 | | Python | 60.988亿 | | R | 7130万 | | Tex | 5.677亿 | | **总计** | **109.557亿** | ## 授权协议我们未修改任何原始数据源的授权协议。 ## 版本历史 **v1.1.0**：包含更新版的OpenWebMath，即[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)当前提供的版本。该版本的OpenWebMath优化了过滤规则，例如移除了过短的文档。 **v1.0.0**：用于训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)的初始数据集，采用OpenWebMath的开发版。 ## 引用方式若引用整个Proof-Pile-2数据集，请使用如下文献： bibtex @misc{azerbayev2023llemma, title={Llemma: An Open Language Model For Mathematics}, author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck}, year={2023}, eprint={2310.10631}, archivePrefix={arXiv}, primaryClass={cs.CL} } 若引用ArXiv子数据集，请使用如下文献： bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } 若引用OpenWebMath数据集，请使用如下文献： bibtex @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} }

提供机构：

maas

创建时间：

2025-08-16

搜集汇总

数据集介绍