MathPile

Name: MathPile
Creator: maas
Published: 2026-01-09 14:29:59
License: 暂无描述

魔搭社区2026-01-09 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/GAIR/MathPile

下载链接

链接失效反馈

官方服务：

资源简介：

<br> **🔥Update**: - [2023/01/06] We release the commercial-use version of MathPile, namely [MathPile_Commercial](https://huggingface.co/datasets/GAIR/MathPile_Commercial). - [2023/01/06] We release the new version (v0.2, cleaner version) of MathPile. It has been updated to the `main` branch (also the `v0.2` branch). The main updates are as follows: - fixed a problem with the display of mathematical formulas in the Wikipedia subset, which was caused by the HTML conversion to markdown; - fixed unclosed caption parentheses in the image environment in arXiv and macro command substitutions (as suggested in [issue 1](https://huggingface.co/datasets/GAIR/MathPile/discussions/1)), as well as improper line wrapping in paragraphs. - If you would like to download the original MathPile, you can download it by setting the `revision` parameter to `v0.1`. - [2023/12/29] Thanks for your interest in our dataset. We strongly recommend that you complete all the information on the form when applying to facilitate our review process. <br> # Dataset Card for Dataset Name  We introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics: <div align="center"> <img src="./imgs/mathpile-features.png" width=45%/> </div> - **Math-centric**: MathPile uniquely caters to the math domain, unlike general domain-focused corpora like Pile and RedPajama, or multilingual-focused ones like ROOTS and The Stack. While there are math-centric corpora, they're often either closed-sourced, like Google's Minerva and OpenAI's MathMix, or lack diversity, such as ProofPile and OpenWebMath. - **Diversity**: MathPile draws from a wide range of sources: **Textbooks** (including lecture notes), **arXiv**, **Wikipedia**, **ProofWiki**, **StackExchange**, and **Web Pages**. It encompasses mathematical content suitable for K-12, college, postgraduate levels, and math competitions. **This diversity is a first, especially with our release of a significant collection of high-quality textbooks (~0.19B tokens).** - **High-Quality**: We adhered to the principle of *less is more*, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. - **Data Documentation**: To enhance transparency, we've extensively documented MathPile. This includes a **dataset sheet** (see Table 5 in our paper) and **quality annotations** for web-sourced documents, like language identification scores and symbol-to-word ratios. This gives users flexibility to tailor the data to their needs. We've also performed **data contamination detection** to eliminate duplicates from benchmark test sets like MATH and MMLU-STEM. <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## Dataset Details Refer to Appendix A in [our paper](https://huggingface.co/papers/2312.17120) for the MathPile Dataset Sheet. ### How to download MathPile? Currently, we recommend that you download it locally from the command line (such as `huggingface-cli`) instead of the python function `load_dataset("GAIR/MathPile")` (due to a possible network issue), unpack the gz file, and then load the jsonl file. Some commands that might be helpful are as follows ``` $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} \; ``` Later we will also support the datasets loading via `load_dataset("GAIR/MathPile")`. Stay tuned. ### Dataset Description  - **Curated by:** GAIR Lab, SJTU - **Funded by [optional]:** GAIR Lab, SJTU - **Language(s) (NLP):** English - **License:** CC BY-NC-SA 4.0 ### Dataset Sources  - **Repository:** https://github.com/GAIR-NLP/MathPile - **Paper [optional]:** https://huggingface.co/papers/2312.17120 - **Demo [optional]:** https://gair-nlp.github.io/MathPile/ ## Uses  ### Direct Use To develop mathematical language models.  ### Out-of-Scope Use  This dataset may be not suitable for scenarios unrelated to mathematics or reasoning. ## Dataset Structure  ``` { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ``` ## Dataset Creation ### Curation Rationale  To create a diverse and high-quality math-centric corpus, thereby enhancing the mathematical reasoning abilities of language models. ### Source Data  #### Data Collection and Processing  We sourced data from Textbooks, lecture notes, arXiv, Wikipedia, ProofWiki, StackExchange, and Common Crawl. Throughout the MathPile development, we meticulously source and gather data, applying a rigorous and math-specific pipeline. This pipeline encompasses various stages such as preprocessing, prefiltering, language identification, cleaning and filtering, and deduplication, all aimed at maintaining the high quality of the corpus. Please see [our paper](https://arxiv.org/abs/2312.17120) for more details. ### Annotations  We provided *quantity annotations* (such as language identification scores and the ratio of symbols to words) for documents from Web pages (i.e., Common Crawl and Wikipedia). These annotations offer future researchers and developers the flexibility to filter the data according to their criteria, tailoring it to their specific needs. #### Personal and Sensitive Information  The corpus may potentially contain academic emails and the author's name, as seen in papers from sources like arXiv. However, we view this as justifiable and within acceptable bounds. ## Bias, Risks, and Limitations  - The decisions made during the data collection and processing phases might not always be optimal. - Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus. ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. ## Citation  If you find our work useful or use MathPile, please cite our paper: ``` @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ``` ## Dataset Card Authors [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## Dataset Card Contact stefanpengfei@gmail.com, zzwang.nlp@gmail.com

🔥 更新公告： - [2023/01/06] 我们发布了MathPile的商用版本，即**MathPile_Commercial（MathPile_Commercial）**，链接为：https://huggingface.co/datasets/GAIR/MathPile_Commercial。 - [2023/01/06] 我们发布了MathPile的新版本（v0.2，净化版本），已更新至`main`分支（同时也是`v0.2`分支）。主要更新内容如下： - 修复了Wikipedia子集内数学公式的显示问题，该问题由HTML转Markdown过程导致； - 修复了arXiv子集图像环境中未闭合的标题括号与宏命令替换问题（参考[issue 1](https://huggingface.co/datasets/GAIR/MathPile/discussions/1)），同时修复了段落内不合理的换行问题。 - 若需下载原始版MathPile，可将`revision`参数设为`v0.1`后进行下载。 - [2023/12/29] 感谢您对本数据集的关注。我们强烈建议您在申请时填写完整表单信息，以加快我们的审核流程。 # 数据集卡片（Dataset Card）  我们推出了MathPile：一个涵盖约95亿Token（Token）的多样化高质量数学领域语料库。本工作与此前相关研究存在以下显著特征差异： <div align="center"> <img src="./imgs/mathpile-features.png" width=45%/> </div> - **以数学为核心（math-centric）**：与Pile、RedPajama等通用领域语料库，或ROOTS、The Stack等多语言语料库不同，MathPile专为数学领域打造。尽管已有部分数学领域语料库，但它们要么为闭源产品（如谷歌的Minerva与OpenAI的MathMix），要么缺乏多样性（如ProofPile与OpenWebMath）。 - **多样性**：MathPile的数据源极为广泛，包括**教科书（含讲义）**、**arXiv**、**Wikipedia**、**ProofWiki**、**StackExchange**与**网页**。其涵盖的数学内容覆盖K12、大学、研究生阶段以及数学竞赛范畴。**这种全方位的多样性尚属首次，尤其是我们发布了总计约0.19B Token的高质量教科书合集。** - **高质量**：我们遵循「少即是多」的原则，坚信即便在预训练阶段，数据质量也优于数据数量。我们通过严谨的数据收集与处理流程，包含复杂的预处理、预过滤、净化、筛选与去重步骤，确保了语料库的高品质。 - **数据文档化**：为提升透明度，我们对MathPile进行了全面的文档记录，包括**数据集表单（dataset sheet）**（详见论文中的表5）以及**网页来源文档的质量标注**（如语言识别得分、符号与词的比例）。这为用户按需定制数据提供了灵活性。此外，我们还完成了**数据污染检测**，以消除MATH与MMLU-STEM等基准测试集的重复数据。 <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## 数据集详情有关MathPile的数据集表单，请参考[我们的论文](https://huggingface.co/papers/2312.17120)中的附录A。 ### 如何下载MathPile？目前，我们建议您通过命令行（如`huggingface-cli`）本地下载该数据集，而非使用Python函数`load_dataset("GAIR/MathPile")`（可能存在网络问题）。请先解压gz文件，再加载jsonl文件。以下为可供参考的命令： $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} ; 后续我们将支持通过`load_dataset("GAIR/MathPile")`加载数据集，敬请期待。 ### 数据集描述  - **数据整理方**：上海交通大学GAIR实验室（GAIR Lab, SJTU） - **资助方（可选）**：上海交通大学GAIR实验室（GAIR Lab, SJTU） - **语言（自然语言处理）**：英语 - **许可协议**：CC BY-NC-SA 4.0 ### 数据集来源  - **代码仓库**：https://github.com/GAIR-NLP/MathPile - **论文（可选）**：https://huggingface.co/papers/2312.17120 - **演示页面（可选）**：https://gair-nlp.github.io/MathPile/ ## 数据集用途  ### 直接用途用于开发数学领域大语言模型。  ### 不适用场景  本数据集可能不适用于与数学或推理无关的场景。 ## 数据集结构  { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ## 数据集构建 ### 构建初衷  为构建一个多样化且高质量的数学领域语料库，从而提升大语言模型的数学推理能力。 ### 源数据  #### 数据收集与处理流程  我们从教科书、讲义、arXiv、Wikipedia、ProofWiki、StackExchange与Commom Crawl获取数据。在MathPile的开发过程中，我们严格遵循数学领域专属的流水线进行数据的精细收集与整合，该流水线涵盖预处理、预过滤、语言识别、净化筛选与去重等多个环节，旨在保障语料库的高质量。更多细节请参考[我们的论文](https://arxiv.org/abs/2312.17120)。 ### 标注信息  我们为网页来源的文档（即Commom Crawl与Wikipedia）提供了**数量标注**（如语言识别得分、符号与词的比例）。这些标注可为后续研究者与开发者提供灵活性，使其可根据自身需求筛选与定制数据。 #### 个人与敏感信息  本语料库可能包含arXiv等来源论文中的学术邮箱与作者姓名，但我们认为此情况在可接受范围内。 ## 偏差、风险与局限性  - 数据收集与处理阶段的决策未必始终最优。 - MathPile中的部分文档未必达到最高质量标准，我们将持续优化与完善该语料库。 ### 使用建议  用户应知晓本数据集存在的风险、偏差与局限性。 ## 引用方式  若您认为本工作有帮助或使用了MathPile，请引用我们的论文： @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ## 数据集卡片撰写者 [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## 数据集卡片联系方式 stefanpengfei@gmail.com, zzwang.nlp@gmail.com

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集