five

proof-pile-2

收藏
魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/EleutherAI/proof-pile-2
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="proofpile_logo.jpg" width="500"> [ArXiv](http://arxiv.org/abs/2310.10631) | [Models](https://huggingface.co/EleutherAI/llemma_34b) | [Data](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [Code](https://github.com/EleutherAI/math-lm) | [Blog](https://blog.eleuther.ai/llemma/) | [Sample Explorer](https://llemma-demo.github.io/) [Zhangir Azerbayev](https://zhangir-azerbayev.github.io/), [Hailey Schoelkopf](https://github.com/haileyschoelkopf), [Keiran Paster](https://keirp.com), [Marco Dos Santos](https://github.com/dsantosmarco), [Stephen McAleer](https://www.andrew.cmu.edu/user/smcaleer/), [Albert Q. Jiang](https://albertqjiang.github.io/), [Jia Deng](https://www.cs.princeton.edu/~jiadeng/), [Stella Biderman](https://www.stellabiderman.com/), [Sean Welleck](https://wellecks.com/) The **Proof-Pile-2** is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b) models. It consists of three subsets: - `arxiv` (29B tokens): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - `open-web-math` (15B tokens): The [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) dataset, which contains much of the high-quality mathematical text from the internet. - `algebraic-stack` (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics. You can download the dataset as follows ```python from datasets import load_dataset ds = load_dataset("EleutherAI/proof-pile-2") # To load only a specific subset, pass it as an argument, e.g ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv") ``` ### Schema Each dataset row has the following structure ```python { "text": ..., # document text "meta": ..., # JSON string of metadata, schema specific to data source } ``` ### Dataset Contents For detailed documentation of the ArXiv and web subsets, refer to [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math). The following table enumerates the contents of the AlgebraicStack by programming language. The AlgebraicStack is filtered to only include documents that contain mathematics, as judged by hand-crafted, language-specific heuristics. | Language | AlgebraicStack tokens | |-----------|-----------------------| | Agda | 35.2 M | | C | 25.1 M | | C++ | 954.1 M | | Coq | 281.9 M | | Fortran | 724.9 M | | GAP | 3.6 M | | Haskell | 9.1 M | | Idris | 10.9 M | | Isabelle | 1,089.7 M | | Julia | 531.0 M | | Jupyter | 199.1 M | | Lean | 285.6 M | | Maple | 2.0 M | | Matlab | 65.8 M | | Python | 6,098.8 M | | R | 71.3 M | | Tex | 567.7 M | | **Total** | **10,955.7 M** | ### License We do not alter the license of any of the underlying data. ### Version History **v1.1.0**: Contains an updated version of OpenWebMath, precisely the one available at [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math). This version of OpenWebMath has slightly improved filtering, for example, removal of very short documents. **v1.0.0**: The data used to train the [Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b) and [Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b). Uses a development version of OpenWebMath. ### Citation For the entire Proof-Pile-2, cite ``` @misc{azerbayev2023llemma, title={Llemma: An Open Language Model For Mathematics}, author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck}, year={2023}, eprint={2310.10631}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` For the ArXiv subset, cite ``` @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } ``` For OpenWebMath, cite ``` @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} } ```

<img src="proofpile_logo.jpg" width="500"> [ArXiv论文](http://arxiv.org/abs/2310.10631) | [模型](https://huggingface.co/EleutherAI/llemma_34b) | [数据集](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | [代码](https://github.com/EleutherAI/math-lm) | [博客](https://blog.eleuther.ai/llemma/) | [样本探索器](https://llemma-demo.github.io/) [张吉尔·阿塞巴耶夫(Zhangir Azerbayev)](https://zhangir-azerbayev.github.io/), [黑莉·舍尔科普夫(Hailey Schoelkopf)](https://github.com/haileyschoelkopf), [基兰·帕斯特(Keiran Paster)](https://keirp.com), [马尔科·多斯·桑托斯(Marco Dos Santos)](https://github.com/dsantosmarco), [斯蒂芬·麦基尔(Stephen McAleer)](https://www.andrew.cmu.edu/user/smcaleer/), [阿尔伯特·Q·江(Albert Q. Jiang)](https://albertqjiang.github.io/), [邓佳(Jia Deng)](https://www.cs.princeton.edu/~jiadeng/), [斯特拉·比德曼(Stella Biderman)](https://www.stellabiderman.com/), [肖恩·韦莱克(Sean Welleck)](https://wellecks.com/) # **Proof-Pile-2** 数据集说明 **Proof-Pile-2** 是一个包含550亿Token(Token)的数学与科学文档数据集。本数据集专为训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)模型而构建。数据集包含三个子模块: - `arxiv`(290亿Token):源自[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)的ArXiv子数据集 - `open-web-math`(150亿Token):[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)数据集,收录了互联网上大量高质量数学文本 - `algebraic-stack`(110亿Token):全新数学代码数据集,涵盖数值计算、计算机代数与形式化数学内容 可通过以下方式下载该数据集: python from datasets import load_dataset ds = load_dataset("EleutherAI/proof-pile-2") # 若仅需加载特定子模块,可将其作为参数传入,例如 ds_arxiv = load_dataset("EleutherAI/proof-pile-2", "arxiv") ## 数据结构(Schema) 数据集的每一行均采用如下结构: python { "text": ..., # 文档文本 "meta": ..., # 元数据JSON字符串,格式由数据源决定 } ## 数据集内容说明 关于ArXiv与网页子数据集的详细文档,请参阅[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)与[OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)。下表按编程语言枚举了AlgebraicStack的内容构成:AlgebraicStack已通过人工编写的语言专属启发式规则进行过滤,仅保留包含数学内容的文档。 | 编程语言 | AlgebraicStack Token数 | |-----------|-----------------------| | Agda | 3520万 | | C | 2510万 | | C++ | 9.541亿 | | Coq | 2.819亿 | | Fortran | 7.249亿 | | GAP | 360万 | | Haskell | 910万 | | Idris | 1090万 | | Isabelle | 10.897亿 | | Julia | 5.310亿 | | Jupyter | 1.991亿 | | Lean | 2.856亿 | | Maple | 200万 | | Matlab | 6580万 | | Python | 60.988亿 | | R | 7130万 | | Tex | 5.677亿 | | **总计** | **109.557亿** | ## 授权协议 我们未修改任何原始数据源的授权协议。 ## 版本历史 **v1.1.0**:包含更新版的OpenWebMath,即[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)当前提供的版本。该版本的OpenWebMath优化了过滤规则,例如移除了过短的文档。 **v1.0.0**:用于训练[Llemma 7B](https://huggingface.co/EleutherAI/llemma_7b)与[Llemma 34B](https://huggingface.co/EleutherAI/llemma_34b)的初始数据集,采用OpenWebMath的开发版。 ## 引用方式 若引用整个Proof-Pile-2数据集,请使用如下文献: bibtex @misc{azerbayev2023llemma, title={Llemma: An Open Language Model For Mathematics}, author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck}, year={2023}, eprint={2310.10631}, archivePrefix={arXiv}, primaryClass={cs.CL} } 若引用ArXiv子数据集,请使用如下文献: bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } 若引用OpenWebMath数据集,请使用如下文献: bibtex @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} }
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作