swallow-math

Name: swallow-math
Creator: maas
Published: 2025-11-27 16:50:47
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/tokyotech-llm/swallow-math

下载链接

链接失效反馈

官方服务：

资源简介：

# SwallowMath <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath Icon" width="600"> ### Resources - 🐙 **GitHub**: Explore the project repository, including pipeline code and prompts at [rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math). - 📑 **arXiv**: Read our paper for detailed methodology and results at [arXiv:2505.02881](https://arxiv.org/abs/2505.02881). - 🤗 **Sister Dataset**: Discover [SwallowCode](https://huggingface.co/datasets/tokyotech-llm/swallow-code), our companion dataset for code generation. ## What is it? SwallowMath is a high-quality mathematical dataset comprising approximately 2.3 billion tokens derived from the [FineMath-4+](https://huggingface.co/datasets/HuggingFaceTB/finemath) dataset through an LLM-driven rewriting pipeline. Using [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), we transform the original dataset by removing boilerplate, restoring missing context, and reformatting solutions into concise, step-by-step explanations. The pipeline prioritizes educational clarity and mathematical reasoning, making SwallowMath ideal for training large language models (LLMs) for mathematical tasks. More details are available in our paper: https://arxiv.org/abs/2505.02881. <img src="figures/finemath-rewriting.png" width="800"/> ## What is being released? The dataset is released as: **SwallowMath**: Approximately 2.3 billion tokens, derived from FineMath-4+ (9.6 billion tokens, 6.7M documents), containing rewritten mathematical content with concise, step-by-step explanations formatted in Markdown and LaTeX. All data is publicly available under the [Llama 3.3 Community License](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/resolve/main/LICENSE). ## Dataset curation SwallowMath builds on FineMath-4+, a high-quality subset of mathematical content filtered from CommonCrawl. We enhance this dataset through an LLM-driven rewriting pipeline tailored for mathematical reasoning, addressing limitations such as boilerplate, missing context, and verbose explanations. ### Rewriting Pipeline Using [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), the pipeline performs the following steps: 1. Remove Boilerplate: Eliminates residual web headers, footers, privacy notices, and extraneous metadata (e.g., question/answer timestamps). 2. Restore Context: Fills in missing information in incomplete questions or answers to ensure clarity and completeness. 3. Rewrite Explanations: Reformats solutions into concise, comprehensive, step-by-step explanations, enhancing educational value. The full rewriting prompt is available at https://github.com/rioyokotalab/swallow-code-math. ## Decontamination ## Results and Performance Continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) for approximately 50 billion tokens, substituting FineMath-4+ with SwallowMath in the math subset (4.79% of the mixture), yields significant improvements: - GSM8K: **+12.4** points accuracy. - MATH: **+7.6** points accuracy. These gains demonstrate SwallowMath’s superior quality for training models in mathematical reasoning. Detailed results are available in the paper (Tables 19 and 20). ## Considerations for Using the Data ### Social Impact of the Dataset SwallowMath aims to democratize access to high-quality mathematical training data, fostering advancements in LLM mathematical reasoning. By releasing an openly licensed dataset, we: - Enhance transparency in the dataset improvement pipeline. - Lower barriers for training mathematically proficient models. ### Discussion of Biases The dataset may inherit biases from FineMath-4+, including: - Focus on English-language content. - Potential over-representation of certain problem types (e.g., algebra vs. geometry). - Influence of Llama-3.3-70B-Instruct’s preferences in solution style and formatting. ## Licensing Information SwallowMath is released under the Llama 3.3 Community License. Usage is also subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use). ## Future work Potential directions include: - Expanding to non-English mathematical content. - Exploring larger pre-training budgets to assess scalability. ## Citation information ``` @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, } ```

# SwallowMath <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath Icon" width="600"> ### 资源 - 🐙 **GitHub**：探索项目仓库（含流水线代码与提示词），地址为[rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math)。 - 📑 **arXiv**：阅读我们的论文以获取详细方法与实验结果，地址为[arXiv:2505.02881](https://arxiv.org/abs/2505.02881)。 - 🤗 **姊妹数据集**：查看我们的代码生成配套数据集[SwallowCode](https://huggingface.co/datasets/tokyotech-llm/swallow-code)。 ## 数据集简介 SwallowMath是一款高质量数学数据集，包含约23亿个Token，源自[FineMath-4+](https://huggingface.co/datasets/HuggingFaceTB/finemath)数据集，通过大语言模型（LLM，Large Language Model）驱动的重写流水线生成。我们使用[Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)对原始数据集进行处理，移除冗余内容、补全缺失上下文，并将解题步骤重新格式化为简洁的分步解释。该流水线优先保障教学清晰度与数学推理逻辑，使SwallowMath非常适合用于训练面向数学任务的大语言模型（LLMs）。更多细节可参阅我们的论文：https://arxiv.org/abs/2505.02881。 <img src="figures/finemath-rewriting.png" width="800"/> ## 发布内容本次发布的数据集为： **SwallowMath**：包含约23亿个Token，源自FineMath-4+（原数据集含96亿个Token、670万个文档），其中的数学内容均经过重写，采用Markdown与LaTeX格式呈现为简洁的分步解释。所有数据均基于[Llama 3.3 Community License](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/resolve/main/LICENSE)开源协议公开可用。 ## 数据集构建 SwallowMath基于FineMath-4+构建，后者是从CommonCrawl中筛选出的高质量数学内容子集。我们通过专为数学推理设计的大语言模型驱动重写流水线，对该数据集进行优化，解决了原数据集存在的冗余内容、缺失上下文与解释冗长等问题。 ### 重写流水线使用[Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)，该流水线执行以下步骤： 1. **移除冗余内容**：剔除残留的网页页眉、页脚、隐私声明与无关元数据（例如问答时间戳）。 2. **补全上下文**：为不完整的问题或答案补充缺失信息，确保内容清晰且完整。 3. **重写解释内容**：将解题步骤重新格式化为简洁、全面的分步解释，提升教学价值。完整的重写提示词可访问：https://github.com/rioyokotalab/swallow-code-math。 ## 数据去污染 ## 实验结果与性能对[Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)进行约500亿个Token的持续预训练，将数学子集（占混合数据集的4.79%）中的FineMath-4+替换为SwallowMath后，模型性能获得显著提升： - GSM8K：准确率提升**+12.4**个百分点。 - MATH：准确率提升**+7.6**个百分点。这些提升证明了SwallowMath在训练数学推理模型方面的优异性能。详细实验结果可参阅论文中的表19与表20。 ## 数据集使用注意事项 ### 数据集的社会影响 SwallowMath旨在让高质量数学训练数据的获取更加普惠，推动大语言模型（LLMs）数学推理能力的发展。通过发布开源许可的数据集，我们： - 提升数据集优化流程的透明度。 - 降低训练具备数学能力的模型的门槛。 ### 偏差讨论本数据集可能继承自FineMath-4+的偏差，包括： - 仅包含英语内容。 - 部分题型（例如代数与几何）可能存在过度代表的情况。 - 解题风格与格式可能受到Llama-3.3-70B-Instruct的偏好影响。 ## 授权信息 SwallowMath基于Llama 3.3 Community License协议发布。使用本数据集还需遵守[CommonCrawl的使用条款](https://commoncrawl.org/terms-of-use)。 ## 未来研究方向未来的研究方向包括： - 拓展至非英语数学内容。 - 探索更大的预训练算力预算以评估模型的可扩展性。 ## 引用信息 @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, }

提供机构：

maas

创建时间：

2025-10-03

搜集汇总

数据集介绍

背景与挑战

背景概述

SwallowMath是一个高质量的数学数据集，通过LLM驱动的重写流程从FineMath-4+数据集衍生而来，包含约23亿个标记，旨在提升大语言模型在数学推理任务上的性能。它采用Llama 3.3 Community License开源许可，并在GSM8K和MATH基准测试中表现出显著的准确性提升。

以上内容由遇见数据集搜集并总结生成