MathPile_Commercial

Name: MathPile_Commercial
Creator: maas
Published: 2025-12-26 14:44:50
License: 暂无描述

魔搭社区2025-12-26 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/GAIR/MathPile_Commercial

下载链接

链接失效反馈

官方服务：

资源简介：

<br> **🔥Update**: - [2024/01/06] We released the commercial-use version of MathPile, namely `MathPile_Commercial`. <br> # Dataset Card for Dataset Name  `MathPile_Commercial` is a commercial-use version of [MathPile](https://huggingface.co/datasets/GAIR/MathPile), obtained by culling documents that are prohibited from commercial use in the MathPile (latest version, i.e., `v0.2`). Specifically, we conducted a non-commercial use detection in the source data, utilizing the license information in the metadata for arXiv sources and employing keyword matching for other sources. As a result, we have excluded approximately 8,000 documents from the latest version of MathPile, comprising 7,350 from arXiv, 518 from Creative Commons sources, 68 from textbooks, and 8 from Wikipedia. This version of the dataset contains around 9.2 billion tokens. MathPile is a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens, which is significantly different from the previous work in the following characteristics: <div align="center"> <img src="./imgs/mathpile-key-features.png" width=45%/> </div> - **Math-centric**: MathPile uniquely caters to the math domain, unlike general domain-focused corpora like Pile and RedPajama, or multilingual-focused ones like ROOTS and The Stack. While there are math-centric corpora, they're often either closed-sourced, like Google's Minerva and OpenAI's MathMix, or lack diversity, such as ProofPile and OpenWebMath. - **Diversity**: MathPile draws from a wide range of sources: **Textbooks** (including lecture notes), **arXiv**, **Wikipedia**, **ProofWiki**, **StackExchange**, and **Web Pages**. It encompasses mathematical content suitable for K-12, college, postgraduate levels, and math competitions. **This diversity is a first, especially with our release of a significant collection of high-quality textbooks (~0.19B tokens).** - **High-Quality**: We adhered to the principle of *less is more*, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. - **Data Documentation**: To enhance transparency, we've extensively documented MathPile. This includes a **dataset sheet** (see Table 5 in our paper) and **quality annotations** for web-sourced documents, like language identification scores and symbol-to-word ratios. This gives users flexibility to tailor the data to their needs. We've also performed **data contamination detection** to eliminate duplicates from benchmark test sets like MATH and MMLU-STEM. <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## Dataset Details Refer to Appendix A in [our paper](https://huggingface.co/papers/2312.17120) for the MathPile Dataset Sheet. ### How to download MathPile? Currently, we recommend that you download it locally from the command line (such as `huggingface-cli`) instead of the python function `load_dataset("GAIR/MathPile")` (due to a possible network issue), unpack the gz file, and then load the jsonl file. Some commands that might be helpful are as follows ``` $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} \; ``` Later we will also support the datasets loading via `load_dataset("GAIR/MathPile")`. Stay tuned. ### Dataset Description  - **Curated by:** GAIR Lab, SJTU - **Funded by [optional]:** GAIR Lab, SJTU - **Language(s) (NLP):** English - **License:** CC BY-SA 4.0 ### Dataset Sources  - **Repository:** https://github.com/GAIR-NLP/MathPile - **Paper [optional]:** https://huggingface.co/papers/2312.17120 - **Demo [optional]:** https://gair-nlp.github.io/MathPile/ ## Uses  ### Direct Use To develop mathematical language models.  ### Out-of-Scope Use  This dataset may be not suitable for scenarios unrelated to mathematics or reasoning. ## Dataset Structure  ``` { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ``` ## Dataset Creation ### Curation Rationale  To create a diverse and high-quality math-centric corpus, thereby enhancing the mathematical reasoning abilities of language models. ### Source Data  #### Data Collection and Processing  We sourced data from Textbooks, lecture notes, arXiv, Wikipedia, ProofWiki, StackExchange, and Common Crawl. Throughout the MathPile development, we meticulously source and gather data, applying a rigorous and math-specific pipeline. This pipeline encompasses various stages such as preprocessing, prefiltering, language identification, cleaning and filtering, and deduplication, all aimed at maintaining the high quality of the corpus. Please see [our paper](https://arxiv.org/abs/2312.17120) for more details. ### Annotations  We provided *quantity annotations* (such as language identification scores and the ratio of symbols to words) for documents from Web pages (i.e., Common Crawl and Wikipedia). These annotations offer future researchers and developers the flexibility to filter the data according to their criteria, tailoring it to their specific needs. #### Personal and Sensitive Information  The corpus may potentially contain academic emails and the author's name, as seen in papers from sources like arXiv. However, we view this as justifiable and within acceptable bounds. ## Bias, Risks, and Limitations  - The decisions made during the data collection and processing phases might not always be optimal. - Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus. ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. ## Citation  If you find our work useful or use MathPile, please cite our paper: ``` @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ``` ## Dataset Card Authors [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## Dataset Card Contact stefanpengfei@gmail.com, zzwang.nlp@gmail.com

🔥更新 - [2024/01/06] 我们发布了MathPile的商用版本，即`MathPile_Commercial`。 # 数据集卡片：数据集名称  `MathPile_Commercial`是[MathPile](https://huggingface.co/datasets/GAIR/MathPile)的商用适配版本，源自移除MathPile（最新版本，即`v0.2`）中禁止商用的文档。具体而言，我们对源数据开展了非商用使用检测：针对arXiv来源的数据，利用元数据中的许可信息进行甄别；针对其他来源，则采用关键词匹配的方式。最终我们从最新版MathPile中移除了约8000份文档，其中arXiv来源7350份、知识共享（Creative Commons）来源518份、教科书68份、维基百科（Wikipedia）8份。该数据集版本包含约92亿Token。 MathPile是一个多样化且高质量的以数学为核心的语料库，总规模约95亿Token，与此前相关工作相比，其具备以下显著特征： <div align="center"> <img src="./imgs/mathpile-key-features.png" width=45%/> </div> - **以数学为核心**：与Pile、RedPajama等通用领域语料库，或ROOTS、The Stack等多语言聚焦语料库不同，MathPile专为数学领域打造。尽管现有部分以数学为核心的语料库，但它们要么属于闭源项目（如谷歌的Minerva与OpenAI的MathMix），要么缺乏多样性（如ProofPile与OpenWebMath）。 - **多样性**：MathPile的数据源覆盖广泛：**教科书（含课堂讲义）**、**arXiv**、**维基百科**、**ProofWiki**、**StackExchange**以及**网页数据**。其包含适用于K12、大学、研究生阶段以及数学竞赛的数学内容。**这种丰富的数据源覆盖是前所未有的，尤其是本次发布还包含了约0.19亿Token的高质量教科书数据集**。 - **高质量**：我们遵循“少而精”的原则，坚信即便在预训练阶段，数据质量也远胜于数据规模。我们通过严谨的数据收集与处理流程，涵盖了复杂的预处理、预过滤、清洗、筛选以及去重等多个环节，确保了语料库的高品质。 - **数据文档化**：为提升透明度，我们对MathPile进行了全面的文档记录，包括**数据集表单**（详见论文中的表5）以及针对网页来源文档的**质量标注**（如语言识别分数、符号与词元之比），这为用户根据自身需求定制数据提供了灵活性。此外，我们还完成了**数据污染检测**，以消除MATH、MMLU-STEM等基准测试集的重复数据。 <div align="center"> <img src="./imgs/mathpile-overview.png" width=70%/> </div> ## 数据集详情请参阅[我们的论文](https://huggingface.co/papers/2312.17120)中的附录A以获取MathPile数据集表单。 ### 如何下载MathPile？当前我们推荐通过命令行（如`huggingface-cli`）本地下载，而非使用Python函数`load_dataset("GAIR/MathPile")`（可能存在网络问题），请先解压gz文件后再加载jsonl文件。以下为可供参考的命令： $ huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir /your/path/ --local-dir-use-symlinks False $ cd /your/path/ $ find . -type f -name "*.gz" -exec gzip -d {} ; 后续我们将支持通过`load_dataset("GAIR/MathPile")`加载数据集，敬请期待。 ### 数据集描述  - **整理方**：上海交通大学GAIR实验室（GAIR Lab, SJTU） - **资助方（可选）**：上海交通大学GAIR实验室 - **语言（自然语言处理）**：英语 - **许可协议**：CC BY-SA 4.0 ### 数据集来源  - **代码仓库**：https://github.com/GAIR-NLP/MathPile - **论文（可选）**：https://huggingface.co/papers/2312.17120 - **演示站点（可选）**：https://gair-nlp.github.io/MathPile/ ## 用途  ### 直接使用用于开发数学领域大语言模型（Large Language Model，LLM）。 ### 超出适用范围的使用本数据集可能不适用于与数学或推理无关的场景。 ## 数据集结构  { "text": ..., "SubSet": "CommomCrawl" | "StackExchange" | "Textbooks" | "Wikipedia" | "ProofWiki" | "arXiv" "meta": {"language_detection_score": , "idx": , "contain_at_least_two_stop_words": , } ## 数据集创建 ### 设计初衷旨在构建一个多样化且高质量的以数学为核心的语料库，从而提升大语言模型的数学推理能力。 ### 源数据  #### 数据收集与处理流程我们的数据源涵盖教科书、课堂讲义、arXiv、维基百科、ProofWiki、StackExchange以及Common Crawl。在MathPile的开发过程中，我们严格遵循数学领域专属的严谨流程进行数据收集与整合，该流程涵盖预处理、预过滤、语言识别、清洗与筛选、去重等多个环节，以保障语料库的高质量。详细信息请参阅[我们的论文](https://arxiv.org/abs/2312.17120)。 ### 标注信息  我们为网页来源的文档（即Common Crawl与维基百科）提供了**量化标注**（如语言识别分数、符号与词元之比），这些标注可为后续研究者与开发者提供灵活的数据筛选依据，以适配其特定需求。 #### 个人与敏感信息  本语料库可能包含arXiv来源论文中的学术邮箱与作者姓名，但我们认为此类信息在可接受的合理范围内。 ## 偏差、风险与局限性  - 数据收集与处理阶段所做的决策未必总能达到最优。 - MathPile中的部分文档未必始终达到最高质量标准，我们将持续对该语料库进行优化与完善。 ### 建议  用户应充分了解本数据集存在的偏差、风险与局限性。 ## 引用  若您认为本工作对您有所帮助或使用了MathPile，请引用我们的论文： @inproceedings{ wang2024mathpile, title={MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Zengzhi Wang and Xuefeng Li and Rui Xia and Pengfei Liu}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=RSvhU69sbG} } ## 数据集卡片作者 [Zengzhi Wang](https://scholar.google.com/citations?user=qLS4f-8AAAAJ&hl=en) ## 数据集卡片联系方式 stefanpengfei@gmail.com, zzwang.nlp@gmail.com

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集