proof-pile-2
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/EleutherAI/proof-pile-2
下载链接
链接失效反馈官方服务:
资源简介:
<img src="proofpile_logo.jpg" width="500">
[ArXiv](http://arxiv.org/abs/2310.10631) | [Models](https://huggingface.co/EleutherAI/llemma_34b) | [Data](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [Code](https://github.com/EleutherAI/math-lm) | [Blog](https://blog.eleuther.ai/llemma/) | [Sample Explorer](https://llemma-demo.github.io/)
[Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/)
The **Proof-Pile-2** is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b) models. It consists of three subsets:
- `arxiv` (29B tokens): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- `open-web-math` (15B tokens): The [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) dataset, which contains much of the high-quality mathematical text from the internet.
- `algebraic-stack` (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.
You can download the dataset as follows
```python
from datasets import load_dataset
ds = load_dataset("EleutherAI/proof-pile-2")
# To load only a specific subset, pass it as an argument, e.g
ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv")
```
### Schema
Each dataset row has the following structure
```python
{
"text": ..., # document text
"meta": ..., # JSON string of metadata, schema specific to data source
}
```
### Dataset Contents
For detailed documentation of the ArXiv and web subsets, refer to [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math). The following table enumerates the contents of the AlgebraicStack by programming language. The AlgebraicStack is filtered to only include documents that contain mathematics, as judged by hand-crafted, language-specific heuristics.
| Language | AlgebraicStack tokens |
|-----------|-----------------------|
| Agda | 35.2 M |
| C | 25.1 M |
| C++ | 954.1 M |
| Coq | 281.9 M |
| Fortran | 724.9 M |
| GAP | 3.6 M |
| Haskell | 9.1 M |
| Idris | 10.9 M |
| Isabelle | 1,089.7 M |
| Julia | 531.0 M |
| Jupyter | 199.1 M |
| Lean | 285.6 M |
| Maple | 2.0 M |
| Matlab | 65.8 M |
| Python | 6,098.8 M |
| R | 71.3 M |
| Tex | 567.7 M |
| **Total** | **10,955.7 M** |
### License
We do not alter the license of any of the underlying data.
### Version History
**v1.1.0**: Contains an updated version of OpenWebMath, precisely the one available at [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math). This version of OpenWebMath has slightly improved filtering, for example, removal of very short documents.
**v1.0.0**: The data used to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b). Uses a development version of OpenWebMath.
### Citation
For the entire Proof-Pile-2, cite
```
@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
For the ArXiv subset, cite
```
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
```
For OpenWebMath, cite
```
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
<img src="proofpile_logo.jpg" width="500">
[ArXiv论文](http://arxiv.org/abs/2310.10631) | [模型](https://huggingface.co/EleutherAI/llemma_34b) | [数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [代码](https://github.com/EleutherAI/math-lm) | [博客](https://blog.eleuther.ai/llemma/) | [样本探索器](https://llemma-demo.github.io/)
[张吉尔·阿塞巴耶夫(Zhangir Azerbayev)](https://zhangir-azerbayev.github.io/), [黑莉·舍尔科普夫(Hailey Schoelkopf)](https://github.com/haileyschoelkopf), [基兰·帕斯特(Keiran Paster)](https://keirp.com), [马尔科·多斯·桑托斯(Marco Dos Santos)](https://github.com/dsantosmarco), [斯蒂芬·麦基尔(Stephen McAleer)](https://www.andrew.cmu.edu/user/smcaleer/), [阿尔伯特·Q·江(Albert Q. Jiang)](https://albertqjiang.github.io/), [邓佳(Jia Deng)](https://www.cs.princeton.edu/~jiadeng/), [斯特拉·比德曼(Stella Biderman)](https://www.stellabiderman.com/), [肖恩·韦莱克(Sean Welleck)](https://wellecks.com/)
# **Proof-Pile-2** 数据集说明
**Proof-Pile-2** 是一个包含550亿Token(Token)的数学与科学文档数据集。本数据集专为训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)模型而构建。数据集包含三个子模块:
- `arxiv`(290亿Token):源自[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)的ArXiv子数据集
- `open-web-math`(150亿Token):[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)数据集,收录了互联网上大量高质量数学文本
- `algebraic-stack`(110亿Token):全新数学代码数据集,涵盖数值计算、计算机代数与形式化数学内容
可通过以下方式下载该数据集:
python
from datasets import load_dataset
ds = load_dataset("EleutherAI/proof-pile-2")
# 若仅需加载特定子模块,可将其作为参数传入,例如
ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv")
## 数据结构(Schema)
数据集的每一行均采用如下结构:
python
{
"text": ..., # 文档文本
"meta": ..., # 元数据JSON字符串,格式由数据源决定
}
## 数据集内容说明
关于ArXiv与网页子数据集的详细文档,请参阅[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)与[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)。下表按编程语言枚举了AlgebraicStack的内容构成:AlgebraicStack已通过人工编写的语言专属启发式规则进行过滤,仅保留包含数学内容的文档。
| 编程语言 | AlgebraicStack Token数 |
|-----------|-----------------------|
| Agda | 3520万 |
| C | 2510万 |
| C++ | 9.541亿 |
| Coq | 2.819亿 |
| Fortran | 7.249亿 |
| GAP | 360万 |
| Haskell | 910万 |
| Idris | 1090万 |
| Isabelle | 10.897亿 |
| Julia | 5.310亿 |
| Jupyter | 1.991亿 |
| Lean | 2.856亿 |
| Maple | 200万 |
| Matlab | 6580万 |
| Python | 60.988亿 |
| R | 7130万 |
| Tex | 5.677亿 |
| **总计** | **109.557亿** |
## 授权协议
我们未修改任何原始数据源的授权协议。
## 版本历史
**v1.1.0**:包含更新版的OpenWebMath,即[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)当前提供的版本。该版本的OpenWebMath优化了过滤规则,例如移除了过短的文档。
**v1.0.0**:用于训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)的初始数据集,采用OpenWebMath的开发版。
## 引用方式
若引用整个Proof-Pile-2数据集,请使用如下文献:
bibtex
@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
若引用ArXiv子数据集,请使用如下文献:
bibtex
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
若引用OpenWebMath数据集,请使用如下文献:
bibtex
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
提供机构:
maas
创建时间:
2025-08-16



