five

MathPile

收藏
魔搭社区2026-01-09 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/GAIR/MathPile
下载链接
链接失效反馈
官方服务:
资源简介:
<br> **🔥Update**: - [2023/01/06] We release the commercial-use version of MathPile, namely [MathPile_Commercial](https://huggingface.co/datasets/GAIR/MathPile_Commercial). - [2023/01/06] We release the new version (v0.2, cleaner version) of MathPile. It has been updated to the `main` branch (also the `v0.2` branch). The main updates are as follows: - fixed a problem with the display of mathematical formulas in the Wikipedia subset, which was caused by the HTML conversion to markdown; - fixed unclosed caption parentheses in the image environment in arXiv and macro command substitutions (as suggested in [issue 1](https://huggingface.co/datasets/GAIR/MathPile/discussions/1)), as well as improper line wrapping in paragraphs. - If you would like to download the original MathPile, you can download it by setting the `revision` parameter to `v0.1`. - [2023/12/29] Thanks for your interest in our dataset. We strongly recommend that you complete all the information on the form when applying to facilitate our review process. <br> # Dataset Card for Dataset Name <!-- Provide a quick summary of the dataset. --> We introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics: <div align="center"> <img src="./imgs/mathpile-features.png" width=45%/> </div> - **Math-centric**: MathPile uniquely caters to the math domain, unlike general domain-focused corpora like Pile and RedPajama, or multilingual-focused ones like ROOTS and The Stack. While there are math-centric corpora, they're often either closed-sourced, like Google's Minerva and OpenAI's MathMix, or lack diversity, such as ProofPile and OpenWebMath. - **Diversity**: MathPile draws from a wide range of sources: **Textbooks** (including lecture notes), **arXiv**, **Wikipedia**, **ProofWiki**, **StackExchange**, and **Web Pages**. It encompasses mathematical content suitable for K-12, college, postgraduate levels, and math competitions. **This diversity is a first, especially with our release of a significant collection of high-quality textbooks (~0.19B tokens).** - **High-Quality**: We adhered to the principle of *less is more*, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. - **Data Documentation**: To enhance transparency, we've extensively documented MathPile. This includes a **dataset sheet** (see Table 5 in our paper) and **quality annotations** for web-sourced documents, like language identification scores and symbol-to-word ratios. This gives users flexibility to tailor the data to their needs. We've also performed **data contamination detection** to eliminate duplicates from benchmark test sets like MATH and MMLU-STEM. <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## Dataset Details Refer to Appendix A in [our paper](https://huggingface.co/papers/2312.17120) for the MathPile Dataset Sheet. ### How to download MathPile? Currently, we recommend that you download it locally from the command line (such as `huggingface-cli`) instead of the python function `load_dataset("GAIR/MathPile")` (due to a possible network issue), unpack the gz file, and then load the jsonl file. Some commands that might be helpful are as follows ``` $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} \; ``` Later we will also support the datasets loading via `load_dataset("GAIR/MathPile")`. Stay tuned. ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** GAIR Lab, SJTU - **Funded by [optional]:** GAIR Lab, SJTU - **Language(s) (NLP):** English - **License:** CC BY-NC-SA 4.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/GAIR-NLP/MathPile - **Paper [optional]:** https://huggingface.co/papers/2312.17120 - **Demo [optional]:** https://gair-nlp.github.io/MathPile/ ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use To develop mathematical language models. <!-- This section describes suitable use cases for the dataset. --> ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> This dataset may be not suitable for scenarios unrelated to mathematics or reasoning. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> ``` { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ``` ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> To create a diverse and high-quality math-centric corpus, thereby enhancing the mathematical reasoning abilities of language models. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> We sourced data from Textbooks, lecture notes, arXiv, Wikipedia, ProofWiki, StackExchange, and Common Crawl. Throughout the MathPile development, we meticulously source and gather data, applying a rigorous and math-specific pipeline. This pipeline encompasses various stages such as preprocessing, prefiltering, language identification, cleaning and filtering, and deduplication, all aimed at maintaining the high quality of the corpus. Please see [our paper](https://arxiv.org/abs/2312.17120) for more details. ### Annotations <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> We provided *quantity annotations* (such as language identification scores and the ratio of symbols to words) for documents from Web pages (i.e., Common Crawl and Wikipedia). These annotations offer future researchers and developers the flexibility to filter the data according to their criteria, tailoring it to their specific needs. #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> The corpus may potentially contain academic emails and the author's name, as seen in papers from sources like arXiv. However, we view this as justifiable and within acceptable bounds. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> - The decisions made during the data collection and processing phases might not always be optimal. - Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> If you find our work useful or use MathPile, please cite our paper: ``` @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ``` ## Dataset Card Authors [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## Dataset Card Contact stefanpengfei@gmail.com, zzwang.nlp@gmail.com

🔥 更新公告: - [2023/01/06] 我们发布了MathPile的商用版本,即**MathPile_Commercial(MathPile_Commercial)**,链接为:https://huggingface.co/datasets/GAIR/MathPile_Commercial。 - [2023/01/06] 我们发布了MathPile的新版本(v0.2,净化版本),已更新至`main`分支(同时也是`v0.2`分支)。主要更新内容如下: - 修复了Wikipedia子集内数学公式的显示问题,该问题由HTML转Markdown过程导致; - 修复了arXiv子集图像环境中未闭合的标题括号与宏命令替换问题(参考[issue 1](https://huggingface.co/datasets/GAIR/MathPile/discussions/1)),同时修复了段落内不合理的换行问题。 - 若需下载原始版MathPile,可将`revision`参数设为`v0.1`后进行下载。 - [2023/12/29] 感谢您对本数据集的关注。我们强烈建议您在申请时填写完整表单信息,以加快我们的审核流程。 # 数据集卡片(Dataset Card) <!-- 请提供数据集的简要概述。 --> 我们推出了MathPile:一个涵盖约95亿Token(Token)的多样化高质量数学领域语料库。本工作与此前相关研究存在以下显著特征差异: <div align="center"> <img src="./imgs/mathpile-features.png" width=45%/> </div> - **以数学为核心(math-centric)**:与Pile、RedPajama等通用领域语料库,或ROOTS、The Stack等多语言语料库不同,MathPile专为数学领域打造。尽管已有部分数学领域语料库,但它们要么为闭源产品(如谷歌的Minerva与OpenAI的MathMix),要么缺乏多样性(如ProofPile与OpenWebMath)。 - **多样性**:MathPile的数据源极为广泛,包括**教科书(含讲义)**、**arXiv**、**Wikipedia**、**ProofWiki**、**StackExchange**与**网页**。其涵盖的数学内容覆盖K12、大学、研究生阶段以及数学竞赛范畴。**这种全方位的多样性尚属首次,尤其是我们发布了总计约0.19B Token的高质量教科书合集。** - **高质量**:我们遵循「少即是多」的原则,坚信即便在预训练阶段,数据质量也优于数据数量。我们通过严谨的数据收集与处理流程,包含复杂的预处理、预过滤、净化、筛选与去重步骤,确保了语料库的高品质。 - **数据文档化**:为提升透明度,我们对MathPile进行了全面的文档记录,包括**数据集表单(dataset sheet)**(详见论文中的表5)以及**网页来源文档的质量标注**(如语言识别得分、符号与词的比例)。这为用户按需定制数据提供了灵活性。此外,我们还完成了**数据污染检测**,以消除MATH与MMLU-STEM等基准测试集的重复数据。 <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## 数据集详情 有关MathPile的数据集表单,请参考[我们的论文](https://huggingface.co/papers/2312.17120)中的附录A。 ### 如何下载MathPile? 目前,我们建议您通过命令行(如`huggingface-cli`)本地下载该数据集,而非使用Python函数`load_dataset("GAIR/MathPile")`(可能存在网络问题)。请先解压gz文件,再加载jsonl文件。以下为可供参考的命令: $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} ; 后续我们将支持通过`load_dataset("GAIR/MathPile")`加载数据集,敬请期待。 ### 数据集描述 <!-- 请提供数据集的详细概述。 --> - **数据整理方**:上海交通大学GAIR实验室(GAIR Lab, SJTU) - **资助方(可选)**:上海交通大学GAIR实验室(GAIR Lab, SJTU) - **语言(自然语言处理)**:英语 - **许可协议**:CC BY-NC-SA 4.0 ### 数据集来源 <!-- 请提供数据集的基础链接。 --> - **代码仓库**:https://github.com/GAIR-NLP/MathPile - **论文(可选)**:https://huggingface.co/papers/2312.17120 - **演示页面(可选)**:https://gair-nlp.github.io/MathPile/ ## 数据集用途 <!-- 请说明本数据集的预期使用场景。 --> ### 直接用途 用于开发数学领域大语言模型。 <!-- 本节描述本数据集的适用场景。 --> ### 不适用场景 <!-- 本节说明误用、恶意使用以及本数据集无法很好适配的使用场景。 --> 本数据集可能不适用于与数学或推理无关的场景。 ## 数据集结构 <!-- 本节提供数据集字段的描述,以及有关数据集结构的额外信息,例如划分数据集的标准、数据点之间的关系等。 --> { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ## 数据集构建 ### 构建初衷 <!-- 说明创建本数据集的动机。 --> 为构建一个多样化且高质量的数学领域语料库,从而提升大语言模型的数学推理能力。 ### 源数据 <!-- 本节描述源数据(例如新闻文本与标题、社交媒体帖子、翻译句子等)。 --> #### 数据收集与处理流程 <!-- 本节描述数据收集与处理过程,例如数据选择标准、过滤与归一化方法、使用的工具与库等。 --> 我们从教科书、讲义、arXiv、Wikipedia、ProofWiki、StackExchange与Commom Crawl获取数据。在MathPile的开发过程中,我们严格遵循数学领域专属的流水线进行数据的精细收集与整合,该流水线涵盖预处理、预过滤、语言识别、净化筛选与去重等多个环节,旨在保障语料库的高质量。更多细节请参考[我们的论文](https://arxiv.org/abs/2312.17120)。 ### 标注信息 <!-- 若数据集包含非初始数据收集阶段的标注,请用本节描述这些标注。 --> 我们为网页来源的文档(即Commom Crawl与Wikipedia)提供了**数量标注**(如语言识别得分、符号与词的比例)。这些标注可为后续研究者与开发者提供灵活性,使其可根据自身需求筛选与定制数据。 #### 个人与敏感信息 <!-- 说明数据集是否包含可能被视为个人、敏感或隐私的数据(例如显示地址、唯一可识别的姓名或别名、种族或族裔出身、性取向、宗教信仰、政治观点、财务或健康数据等)。若已对数据进行匿名化处理,请描述匿名化过程。 --> 本语料库可能包含arXiv等来源论文中的学术邮箱与作者姓名,但我们认为此情况在可接受范围内。 ## 偏差、风险与局限性 <!-- 本节说明技术与社会技术层面的局限性。 --> - 数据收集与处理阶段的决策未必始终最优。 - MathPile中的部分文档未必达到最高质量标准,我们将持续优化与完善该语料库。 ### 使用建议 <!-- 本节说明有关偏差、风险与技术局限性的使用建议。 --> 用户应知晓本数据集存在的风险、偏差与局限性。 ## 引用方式 <!-- 若有介绍本数据集的论文或博客文章,请在此处提供其APA与Bibtex引用信息。 --> 若您认为本工作有帮助或使用了MathPile,请引用我们的论文: @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ## 数据集卡片撰写者 [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## 数据集卡片联系方式 stefanpengfei@gmail.com, zzwang.nlp@gmail.com
提供机构:
maas
创建时间:
2025-02-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作