MathPile_Commercial
收藏魔搭社区2025-12-26 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/GAIR/MathPile_Commercial
下载链接
链接失效反馈官方服务:
资源简介:
<br>
**🔥Update**:
- [2024/01/06] We released the commercial-use version of MathPile, namely `MathPile_Commercial`.
<br>
# Dataset Card for Dataset Name
<!-- Provide a quick summary of the dataset. -->
`MathPile_Commercial` is a commercial-use version of [MathPile](https://huggingface.co/datasets/GAIR/MathPile), obtained by culling documents that are prohibited from commercial use in the MathPile (latest version, i.e., `v0.2`). Specifically, we conducted a non-commercial use detection in the source data, utilizing the license information in the metadata for arXiv sources and employing keyword matching for other sources. As a result, we have excluded approximately 8,000 documents from the latest version of MathPile, comprising 7,350 from arXiv, 518 from Creative Commons sources, 68 from textbooks, and 8 from Wikipedia. This version of the dataset contains around 9.2 billion tokens.
MathPile is a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens, which is significantly different from the previous work in the following characteristics:
<div align="center">
<img src="./imgs/mathpile-key-features.png" width=45%/>
</div>
- **Math-centric**: MathPile uniquely caters to the math domain, unlike general domain-focused corpora like Pile and RedPajama, or multilingual-focused ones like ROOTS and The Stack. While there are math-centric corpora, they're often either closed-sourced, like Google's Minerva and OpenAI's MathMix, or lack diversity, such as ProofPile and OpenWebMath.
- **Diversity**: MathPile draws from a wide range of sources: **Textbooks** (including lecture notes), **arXiv**, **Wikipedia**, **ProofWiki**, **StackExchange**, and **Web Pages**. It encompasses mathematical content suitable for K-12, college, postgraduate levels, and math competitions. **This diversity is a first, especially with our release of a significant collection of high-quality textbooks (~0.19B tokens).**
- **High-Quality**: We adhered to the principle of *less is more*, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, cleaning, filtering, and deduplication, ensuring the high quality of our corpus.
- **Data Documentation**: To enhance transparency, we've extensively documented MathPile. This includes a **dataset sheet** (see Table 5 in our paper) and **quality annotations** for web-sourced documents, like language identification scores and symbol-to-word ratios. This gives users flexibility to tailor the data to their needs. We've also performed **data contamination detection** to eliminate duplicates from benchmark test sets like MATH and MMLU-STEM.
<div align="center">
<img src="./imgs/mathpile-overview.png" width=70%/>
</div>
## Dataset Details
Refer to Appendix A in [our paper](https://huggingface.co/papers/2312.17120) for the MathPile Dataset Sheet.
### How to download MathPile?
Currently, we recommend that you download it locally from the command line (such as `huggingface-cli`) instead of the python function `load_dataset("GAIR/MathPile")` (due to a possible network issue), unpack the gz file, and then load the jsonl file. Some commands that might be helpful are as follows
```
$ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False
$ cd /your/path/
$ find . -type f -name "*.gz" -exec gzip -d {} \;
```
Later we will also support the datasets loading via `load_dataset("GAIR/MathPile")`. Stay tuned.
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** GAIR Lab, SJTU
- **Funded by [optional]:** GAIR Lab, SJTU
- **Language(s) (NLP):** English
- **License:** CC BY-SA 4.0
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/GAIR-NLP/MathPile
- **Paper [optional]:** https://huggingface.co/papers/2312.17120
- **Demo [optional]:** https://gair-nlp.github.io/MathPile/
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
To develop mathematical language models.
<!-- This section describes suitable use cases for the dataset. -->
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset may be not suitable for scenarios unrelated to mathematics or reasoning.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
{
"text": ...,
"SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv"
"meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": ,
}
```
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
To create a diverse and high-quality math-centric corpus, thereby enhancing the mathematical reasoning abilities of language models.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
We sourced data from Textbooks, lecture notes, arXiv, Wikipedia, ProofWiki, StackExchange, and Common Crawl. Throughout the MathPile development, we meticulously source and
gather data, applying a rigorous and math-specific pipeline. This pipeline encompasses various stages such as preprocessing, prefiltering, language identification, cleaning and filtering, and deduplication,
all aimed at maintaining the high quality of the corpus. Please see [our paper](https://arxiv.org/abs/2312.17120) for more details.
### Annotations
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
We provided *quantity annotations* (such as language identification scores and the ratio of symbols to words) for documents from Web pages (i.e., Common Crawl and Wikipedia). These annotations offer future researchers and developers
the flexibility to filter the data according to their criteria, tailoring it to their specific needs.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
The corpus may potentially contain academic emails and the author's name, as seen in papers from sources like arXiv. However, we view this as justifiable and within acceptable bounds.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- The decisions made during the data collection and processing phases might not always be optimal.
- Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
If you find our work useful or use MathPile, please cite our paper:
```
@inproceedings{
wang2024mathpile,
title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math},
author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=RSvhU69sbG}
}
```
## Dataset Card Authors
[Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en)
## Dataset Card Contact
stefanpengfei@gmail.com, zzwang.nlp@gmail.com
🔥更新
- [2024/01/06] 我们发布了MathPile的商用版本,即`MathPile_Commercial`。
# 数据集卡片:数据集名称
<!-- 请提供该数据集的简要概述。 -->
`MathPile_Commercial`是[MathPile](https://huggingface.co/datasets/GAIR/MathPile)的商用适配版本,源自移除MathPile(最新版本,即`v0.2`)中禁止商用的文档。具体而言,我们对源数据开展了非商用使用检测:针对arXiv来源的数据,利用元数据中的许可信息进行甄别;针对其他来源,则采用关键词匹配的方式。最终我们从最新版MathPile中移除了约8000份文档,其中arXiv来源7350份、知识共享(Creative Commons)来源518份、教科书68份、维基百科(Wikipedia)8份。该数据集版本包含约92亿Token。
MathPile是一个多样化且高质量的以数学为核心的语料库,总规模约95亿Token,与此前相关工作相比,其具备以下显著特征:
<div align="center">
<img src="./imgs/mathpile-key-features.png" width=45%/>
</div>
- **以数学为核心**:与Pile、RedPajama等通用领域语料库,或ROOTS、The Stack等多语言聚焦语料库不同,MathPile专为数学领域打造。尽管现有部分以数学为核心的语料库,但它们要么属于闭源项目(如谷歌的Minerva与OpenAI的MathMix),要么缺乏多样性(如ProofPile与OpenWebMath)。
- **多样性**:MathPile的数据源覆盖广泛:**教科书(含课堂讲义)**、**arXiv**、**维基百科**、**ProofWiki**、**StackExchange**以及**网页数据**。其包含适用于K12、大学、研究生阶段以及数学竞赛的数学内容。**这种丰富的数据源覆盖是前所未有的,尤其是本次发布还包含了约0.19亿Token的高质量教科书数据集**。
- **高质量**:我们遵循“少而精”的原则,坚信即便在预训练阶段,数据质量也远胜于数据规模。我们通过严谨的数据收集与处理流程,涵盖了复杂的预处理、预过滤、清洗、筛选以及去重等多个环节,确保了语料库的高品质。
- **数据文档化**:为提升透明度,我们对MathPile进行了全面的文档记录,包括**数据集表单**(详见论文中的表5)以及针对网页来源文档的**质量标注**(如语言识别分数、符号与词元之比),这为用户根据自身需求定制数据提供了灵活性。此外,我们还完成了**数据污染检测**,以消除MATH、MMLU-STEM等基准测试集的重复数据。
<div align="center">
<img src="./imgs/mathpile-overview.png" width=70%/>
</div>
## 数据集详情
请参阅[我们的论文](https://huggingface.co/papers/2312.17120)中的附录A以获取MathPile数据集表单。
### 如何下载MathPile?
当前我们推荐通过命令行(如`huggingface-cli`)本地下载,而非使用Python函数`load_dataset("GAIR/MathPile")`(可能存在网络问题),请先解压gz文件后再加载jsonl文件。以下为可供参考的命令:
$ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False
$ cd /your/path/
$ find . -type f -name "*.gz" -exec gzip -d {} ;
后续我们将支持通过`load_dataset("GAIR/MathPile")`加载数据集,敬请期待。
### 数据集描述
<!-- 请提供该数据集的详细概述。 -->
- **整理方**:上海交通大学GAIR实验室(GAIR Lab, SJTU)
- **资助方(可选)**:上海交通大学GAIR实验室
- **语言(自然语言处理)**:英语
- **许可协议**:CC BY-SA 4.0
### 数据集来源
<!-- 请提供该数据集的基础链接。 -->
- **代码仓库**:https://github.com/GAIR-NLP/MathPile
- **论文(可选)**:https://huggingface.co/papers/2312.17120
- **演示站点(可选)**:https://gair-nlp.github.io/MathPile/
## 用途
<!-- 请阐述该数据集的预期使用场景。 -->
### 直接使用
用于开发数学领域大语言模型(Large Language Model,LLM)。
### 超出适用范围的使用
本数据集可能不适用于与数学或推理无关的场景。
## 数据集结构
<!-- 请描述数据集的字段信息,以及其他相关结构信息,如划分标准、数据点间的关系等。 -->
{
"text": ...,
"SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv"
"meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": ,
}
## 数据集创建
### 设计初衷
旨在构建一个多样化且高质量的以数学为核心的语料库,从而提升大语言模型的数学推理能力。
### 源数据
<!-- 请描述源数据的类型,如新闻文本与标题、社交媒体帖子、翻译语句等。 -->
#### 数据收集与处理流程
我们的数据源涵盖教科书、课堂讲义、arXiv、维基百科、ProofWiki、StackExchange以及Common Crawl。在MathPile的开发过程中,我们严格遵循数学领域专属的严谨流程进行数据收集与整合,该流程涵盖预处理、预过滤、语言识别、清洗与筛选、去重等多个环节,以保障语料库的高质量。详细信息请参阅[我们的论文](https://arxiv.org/abs/2312.17120)。
### 标注信息
<!-- 若数据集包含初始收集之外的标注,请在此处描述。 -->
我们为网页来源的文档(即Common Crawl与维基百科)提供了**量化标注**(如语言识别分数、符号与词元之比),这些标注可为后续研究者与开发者提供灵活的数据筛选依据,以适配其特定需求。
#### 个人与敏感信息
<!-- 请说明该数据集是否包含可被视为个人、敏感或私密的数据(如地址、唯一可识别的姓名或别名、种族或族裔来源、性取向、宗教信仰、政治观点、财务或健康数据等)。若已采取匿名化措施,请描述该流程。 -->
本语料库可能包含arXiv来源论文中的学术邮箱与作者姓名,但我们认为此类信息在可接受的合理范围内。
## 偏差、风险与局限性
<!-- 请阐述技术与社会技术层面的局限性。 -->
- 数据收集与处理阶段所做的决策未必总能达到最优。
- MathPile中的部分文档未必始终达到最高质量标准,我们将持续对该语料库进行优化与完善。
### 建议
<!-- 请针对偏差、风险与技术局限性给出相关建议。 -->
用户应充分了解本数据集存在的偏差、风险与局限性。
## 引用
<!-- 若有介绍该数据集的论文或博客文章,请在此处提供APA和Bibtex格式的引用信息。 -->
若您认为本工作对您有所帮助或使用了MathPile,请引用我们的论文:
@inproceedings{
wang2024mathpile,
title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math},
author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=RSvhU69sbG}
}
## 数据集卡片作者
[Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en)
## 数据集卡片联系方式
stefanpengfei@gmail.com, zzwang.nlp@gmail.com
提供机构:
maas
创建时间:
2025-02-08



