EleutherAI-proof-pile-2
收藏魔搭社区2025-10-20 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/xpengx/EleutherAI-proof-pile-2
下载链接
链接失效反馈官方服务:
资源简介:
<img src="proofpile_logo.jpg" width="500">
[ArXiv](http://arxiv.org/abs/2310.10631) | [Models](https://huggingface.co/EleutherAI/llemma_34b) | [Data](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [Code](https://github.com/EleutherAI/math-lm) | [Blog](https://blog.eleuther.ai/llemma/) | [Sample Explorer](https://llemma-demo.github.io/)
[Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/)
The **Proof-Pile-2** is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b) models. It consists of three subsets:
- `arxiv` (29B tokens): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- `open-web-math` (15B tokens): The [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) dataset, which contains much of the high-quality mathematical text from the internet.
- `algebraic-stack` (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.
You can download the dataset as follows
```python
from datasets import load_dataset
ds = load_dataset("EleutherAI/proof-pile-2")
# To load only a specific subset, pass it as an argument, e.g
ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv")
```
### Schema
Each dataset row has the following structure
```python
{
"text": ..., # document text
"meta": ..., # JSON string of metadata, schema specific to data source
}
```
### Dataset Contents
For detailed documentation of the ArXiv and web subsets, refer to [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math). The following table enumerates the contents of the AlgebraicStack by programming language. The AlgebraicStack is filtered to only include documents that contain mathematics, as judged by hand-crafted, language-specific heuristics.
| Language | AlgebraicStack tokens |
|-----------|-----------------------|
| Agda | 35.2 M |
| C | 25.1 M |
| C++ | 954.1 M |
| Coq | 281.9 M |
| Fortran | 724.9 M |
| GAP | 3.6 M |
| Haskell | 9.1 M |
| Idris | 10.9 M |
| Isabelle | 1,089.7 M |
| Julia | 531.0 M |
| Jupyter | 199.1 M |
| Lean | 285.6 M |
| Maple | 2.0 M |
| Matlab | 65.8 M |
| Python | 6,098.8 M |
| R | 71.3 M |
| Tex | 567.7 M |
| **Total** | **10,955.7 M** |
### License
We do not alter the license of any of the underlying data.
### Version History
**v1.1.0**: Contains an updated version of OpenWebMath, precisely the one available at [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math). This version of OpenWebMath has slightly improved filtering, for example, removal of very short documents.
**v1.0.0**: The data used to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b). Uses a development version of OpenWebMath.
### Citation
For the entire Proof-Pile-2, cite
```
@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
For the ArXiv subset, cite
```
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
```
For OpenWebMath, cite
```
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
<img src="proofpile_logo.jpg" width="500">
[ArXiv论文](http://arxiv.org/abs/2310.10631) | [模型](https://huggingface.co/EleutherAI/llemma_34b) | [数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [代码](https://github.com/EleutherAI/math-lm) | [博客](https://blog.eleuther.ai/llemma/) | [样本浏览器](https://llemma-demo.github.io/)
[Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/)
**Proof-Pile-2** 是一个包含550亿Token(Token)的数学与科学文献数据集。本数据集旨在用于训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)大语言模型(Large Language Model)。数据集包含三个子集:
- `arxiv`(290亿Token):取自[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)的ArXiv子集
- `open-web-math`(150亿Token):即[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)数据集,收录了互联网上大量高质量数学文本
- `algebraic-stack`(110亿Token):全新的数学代码数据集,涵盖数值计算、计算机代数与形式化数学内容。
你可以通过如下方式下载该数据集
python
from datasets import load_dataset
ds = load_dataset("EleutherAI/proof-pile-2")
# 若仅需加载特定子集,可传入对应参数,例如:
ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv")
### 数据结构
每个数据集行包含如下结构
python
{
"text": ..., # 文档文本
"meta": ..., # 元数据JSON字符串,格式由数据源决定
}
### 数据集详情
关于ArXiv与网页子集的详细文档,请参考[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)与[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)。下表按编程语言枚举了AlgebraicStack的内容。AlgebraicStack经过过滤,仅保留通过人工设计的语言专属启发式规则判定为包含数学内容的文档。
| 编程语言 | AlgebraicStack Token数 |
|-----------|-----------------------|
| Agda | 35.2 M |
| C | 25.1 M |
| C++ | 954.1 M |
| Coq | 281.9 M |
| Fortran | 724.9 M |
| GAP | 3.6 M |
| Haskell | 9.1 M |
| Idris | 10.9 M |
| Isabelle | 1,089.7 M |
| Julia | 531.0 M |
| Jupyter | 199.1 M |
| Lean | 285.6 M |
| Maple | 2.0 M |
| Matlab | 65.8 M |
| Python | 6,098.8 M |
| R | 71.3 M |
| Tex | 567.7 M |
| **总计** | **10,955.7 M** |
### 授权协议
我们未修改任何原始数据的授权协议。
### 版本历史
**v1.1.0**:包含更新版OpenWebMath,即[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)当前提供的版本。该版本的OpenWebMath优化了过滤规则,例如移除了过短的文档。
**v1.0.0**:用于训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)的数据集,使用了开发版OpenWebMath。
### 引用格式
如需引用整个Proof-Pile-2数据集,请使用:
@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
如需引用ArXiv子集,请使用:
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
如需引用OpenWebMath数据集,请使用:
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
提供机构:
maas
创建时间:
2024-09-30



