five

swallow-math-v2

收藏
魔搭社区2026-01-06 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/tokyotech-llm/swallow-math-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# SwallowMath-v2 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath-v2 Icon" width="500"> ### Resources - 📑 **arXiv**: Read our paper for detailed methodology at [arXiv:2505.02881](https://arxiv.org/abs/2505.02881). - 🤗 **Sister Dataset**: Discover [SwallowCode2](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2), our companion dataset for code generation. ## 🧮 What is it? [SwallowMath-v2](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2) is a large-scale mathematical dataset containing **32 billion tokens**, developed as the successor to [SwallowMath-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-math). Building on the success of v1, this release aims to construct a **larger-scale and more permissively licensed** corpus to support open and reproducible research on mathematical reasoning for large language models (LLMs). As in our previous dataset SwallowMath-v1, SwallowMath-v2 employs an **LLM-driven rewriting approach**—removing boilerplate, restoring missing context, and reformatting solutions into clear, step-by-step explanations. Additionally, we explored multiple rewriting styles and adopted the two most effective ones—Textbook and Q&A—in the final synthesis stage, yielding higher consistency and reasoning quality. Empirical evaluations demonstrate that models trained with SwallowMath-v2 achieve stronger performance on **GSM-Plus** and **BBH**, surpassing other open mathematical datasets.<br> <sub>† On the MATH benchmark, the SwallowMath-v2 (Q&A) variant performs slightly below Nemotron-CC-Math-v1-4+. However, SwallowMath-v2 offers a significantly more permissive Apache-2.0 license, providing clearer usage rights for both research and commercial applications.</sub> <img src="./swallow_math-v2.jpg" width="800"/> ## 📊 Dataset Comparison | Dataset | Token Count (Llama-3 Tokenizer) | License | | :-------------------------------- | :-----------------------------: | :--------------------------------- | | **[Nemotron-CC-Math-v1 4+](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1)** | 51.4 B tokens | [NVIDIA Open Data License Agreement](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) | | [MegaMathWeb-Pro](https://huggingface.co/datasets/LLM360/MegaMath) | 13.0B tokens | Open Data Commons License Attribution family | | **SwallowMath-v1 (our previous)** | 3.6 B tokens | Llama-3.3 Community License | | **SwallowMath-v2 (this work)** | 32.0 B tokens | **Apache 2.0 License** | ## 📦 What is being released? **SwallowMath-v2**: Approximately **32** billion tokens, derived from FineMath-3+, containing rewritten mathematical content with concise, step-by-step explanations formatted in Markdown and LaTeX. All data is publicly available under the **Apache 2.0** license. ### 🗂️ Dataset structure - [stage1-length-filter](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage1-length-filter): Filtered subset of finemath-3+ by text length - [stage2-extract-math-text](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage2-extract-math-text): Refined version with LLM-based extraction of stage-1 - [stage3-ablations](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-ablations): Datasets for rewriting-style ablation experiments - [stage3-qa](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-qa): SwallowMath-v2 (Q&A) dataset (12,635,739 samples, **13.6B** tokens) - [stage3-textbook](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-textbook): SwallowMath-v2 (textbook) (13,302,336 samples, **18.3B** tokens) ## 🧩 Dataset curation SwallowMath-v2 builds on FineMath-3+, a high-quality subset of mathematical content filtered from CommonCrawl. We enhance it through an **LLM-driven rewriting pipeline** tailored for mathematical reasoning, addressing key limitations such as boilerplate, missing context, and verbose explanations. ### ⚙️ Rewriting Pipeline Using [Qwen3-235B-A22B-2507-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507), the pipeline proceeded as follows: 1. **Stage 1 - Length Filtering**: Remove over-long samples from FineMath-3+ to saty with model context limits. 2. **Stage 2 - Math Extraction**: Extract mathmatical text segments from stage1 using LLM(= [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)) 3. **Stage 3 - Rewriting**: Rewrite Stage 2 samples into Q&A and textbook styles. ### 🧪 Rewriting style ablation experiments We designed and compared five rewriting styles: 1. Textbook — Structured presentation of definitions, worked examples, and solution procedures 2. Q&A — Single/Multi-turn question–answer format 3. Planning — Explicit plan description followed by step-wise reasoning 4. Socratic — Teacher–student dialogue, solving the problem interactively 5. Multiple Solution — Generation of multiple candidate solutions with justification for the optimal one As shown in the figure below, the **Q&A** style yields the highest performance on *GSM8K* and *GSM-Plus*, the **Textbook** style performs best on BBH, and both are effective on MATH. These findings motivated the adoption of the Textbook and Q&A styles in the final SwallowMath-v2 dataset. <img src="rewriting-method.png" width="800"/> ### Q&A style example Here is an example of a Q&A-style rewritten dataset. ```json **Question 1**: What is the length of the line segment connecting the points $(-2, 4)$ and $(-1, 1)$? **Answer 1**: The length of a line segment between two points $(x_1, y_1)$ and $(x_2, y_2)$ is calculated using the distance formula: $$\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$ For the points $(-2, 4)$ and $(-1, 1)$: $$\sqrt{(-1 - (-2))^2 + (1 - 4)^2} = \sqrt{1^2 + (-3)^2} = \sqrt{1 + 9} = \sqrt{10}$$ **Code Implementation 1**: ```python import math # Calculate distance between (-2, 4) and (-1, 1) x1, y1 = -2, 4 x2, y2 = -1, 1 distance = math.sqrt((x2 - x1)**2 + (y2 - y1)**2) print(f"Distance = {distance:.4f} (exact value: √10 ≈ {math.sqrt(10):.4f})") # Output: Distance = 3.1623 (exact value: √10 ≈ 3.1623) ``` ### textbook style example Here is an example of a textbook-style rewritten dataset. ```json ## Background: Arc length is a fundamental concept in calculus that measures the distance along a curved path. Unlike straight-line distances (which use the Pythagorean theorem), curved paths require integration because their slope continuously changes. The formula for arc length derives from approximating a curve with infinitesimally small straight-line segments and summing their lengths—a technique formalized in the 17th century with the development of calculus. This concept is essential in physics (e.g., calculating the path of a moving object), engineering (e.g., designing roads or bridges), and computer graphics (e.g., rendering smooth curves). ## Detailed Explanation: The text explains the mathematical framework for computing the arc length of a curve defined by $y = f(x)$, using both general theory and a specific example. Below is a breakdown of each component, preserving all original formulas and data. ### General Formula The arc length of a curve $y = f(x)$ is given by the integral: $$ \int \sqrt{1 + \left(f'(x)\right)^2} \, dx $$ This formula arises from the Pythagorean theorem applied to infinitesimal segments. For a tiny segment of the curve, the horizontal change is $dx$ and the vertical change is $dy = f'(x) \, dx$. The length of this segment is $\sqrt{(dx)^2 + (dy)^2} = \sqrt{(dx)^2 + (f'(x) \, dx)^2} = \sqrt{1 + (f'(x))^2} \, dx$. Summing these infinitesimal lengths via integration gives the total arc length. ### Example: Parabola $y = x^2$ For the parabola $y = x^2$ between $x = -2$ and $x = 2$, the arc length is computed as: $$ \int_{-2}^{2} \sqrt{1 + (2x)^2} \, dx = \int_{-2}^{2} \sqrt{1 + 4x^2} \, dx $$ **Derivation**: - The derivative of $y = x^2$ is $f'(x) = 2x$. - Substituting into the general formula gives $\sqrt{1 + (2x)^2} = \sqrt{1 + 4x^2}$. - The limits $x = -2$ to $x = 2$ define the interval of integration. This integral evaluates to approximately $9.7$ (as noted later in the text), though the exact value requires trigonometric substitution or numerical methods. ### Concept Check: Line Segment Length The text verifies understanding with a simpler case: the straight-line distance between $(-2, 4)$ and $(-1, 1)$. The calculation is: $$ \sqrt{(-1 - (-2))^2 + (1 - 4)^2} = \sqrt{1^2 + (-3)^2} = \sqrt{10} $$ **Why this matters**: - This is the discrete analog of the arc length formula. For a straight line, the derivative $f'(x)$ is constant, so the integral simplifies to the distance formula. - Here, $\Delta x = 1$ and $\Delta y = -3$, matching the Pythagorean theorem $\sqrt{(\Delta x)^2 + (\Delta y)^2}$. ### Key Takeaways 1. **Approximation via line segments**: The arc length is approximated by summing tiny line segments: $$ \sum \sqrt{(\Delta x)^2 + (\Delta y)^2} $$ As $\Delta x \to 0$, this sum becomes the integral $\int \sqrt{1 + \left(\frac{dy}{dx}\right)^2} \, dx$. This is the foundation of the formula. 2. **Derivative's role**: The term $\frac{dy}{dx}$ (or $f'(x)$) accounts for the curve's slope. Steeper slopes increase the integrand, reflecting longer path lengths for the same horizontal distance. 3. **Specific application to $y = x^2$**: For $y = x^2$, $\frac{dy}{dx} = 2x$, so the integrand becomes $\sqrt{1 + (2x)^2} = \sqrt{1 + 4x^2}$. This shows how the derivative directly shapes the integral. ### Units of Arc Length The text clarifies that **the unit of arc length matches the unit of the coordinate axes**. For example: - If $x$ and $y$ are measured in inches, the arc length $\int_{-2}^{2} \sqrt{1 + 4x^2} \, dx \approx 9.7$ is also in inches. - This holds because both $dx$ and $dy$ inherit the axis units, and the square root operation preserves dimensional consistency. This principle ensures physical meaningfulness in real-world applications (e.g., calculating the length of a wire bent into a parabolic shape). ``` ### 📈 Rewriting model scalability We investigated whether **the scale of the rewriting model** influences the quality of the generated data. Using identical prompts, we compared generations from **Qwen3-30B-A3B** and **Qwen3-235B-A22B**, observing the effect of model size on output quality. Results (see figure below) indicate no significant improvement in downstream performance with larger rewriting models—suggesting that dataset quality is primarily governed by prompt design and rewriting style rather than model scale.<br> <sub>† SwallowMath-v1, our previous dataset, was generated from FineMath-4+ using Llama-3.3-70B-Instruct. It is therefore **not directly related** to the model scalability experiments presented here. The dataset is relatively small—about 3.6 billion tokens, roughly one-tenth the size of SwallowMath-v2—and is shown only for reference.</sub> <img src="swallow_math-v2-model-size.jpg" size="800"/> ## 📝 Considerations for Using the Data ### Social Impact of the Dataset SwallowMath-v2 aims to democratize access to high-quality mathematical training data, fostering advancements in LLM mathematical reasoning. By releasing an openly licensed dataset, we enhance transparency in the dataset improvement pipeline, and lower barriers for training mathematically proficient models. ### Discussion of Biases The dataset may inherit biases from FineMath-3+, including: - Focus on English-language content. - Potential over-representation of certain problem types (e.g., algebra vs. geometry). ## ⚖️ Licensing Information SwallowMath-v2 is released under the **Apache-2.0** license. ## 👥 Contributors The dataset was primarily developed by the following contributors: - [Kazuki Fujii](https://www.linkedin.com/in/kazuki-fujii/) — Designed the experiments, implemented the data pipeline, and conducted the experiments. - [Yukito Tajima](https://www.linkedin.com/in/yukito-tajima-51bbb2299/) — Implemented the data pipeline and optimized the inference pipeline. (vLLM, TensorRT-LLM) - [Masaki Kawamura](https://www.linkedin.com/in/masaki-kawamura-0806a7361/) — Co-designed the experiments, evaluated the models, and performed visualization and analysis. ## 📖 Citation ``` @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, } ```

# SwallowMath-v2 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath-v2 Icon" width="500"> ### 相关资源 - 📑 **arXiv**:可查阅我们的论文以获取详细方法论,链接为[arXiv:2505.02881](https://arxiv.org/abs/2505.02881)。 - 🤗 **姊妹数据集**:可访问[SwallowCode2](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2),这是我们配套的代码生成相关数据集。 ## 🧮 数据集简介 [SwallowMath-v2](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2) 是一个规模达**320亿Token**的大型数学数据集,作为[SwallowMath-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-math)的继任版本开发。 承接v1版本的成功经验,本次发布旨在构建一个**规模更大、授权协议更宽松**的语料库,以支持针对大语言模型(Large Language Model,LLM)数学推理能力的开放且可复现的研究。 与此前的SwallowMath-v1数据集一致,SwallowMath-v2采用了**大语言模型驱动的重写方案**——移除冗余套话、补充缺失上下文,并将解题过程重新格式化为清晰的分步解释。 此外,我们探索了多种重写风格,并在最终合成阶段选用了效果最优的两种:教科书式(Textbook)与问答式(Q&A),以此提升内容的一致性与推理质量。 实证评估表明,基于SwallowMath-v2训练的模型在**GSM-Plus**与**BBH**基准上表现更优,超越其他开源数学数据集。<br> <sub>† 在MATH基准测试中,SwallowMath-v2(问答式)变体的性能略低于Nemotron-CC-Math-v1-4+。但SwallowMath-v2采用了授权更宽松的Apache-2.0许可证,为科研与商业应用提供了更明确的使用权限。</sub> <img src="./swallow_math-v2.jpg" width="800"/> ## 📊 数据集对比 | 数据集名称 | Token 数量(基于Llama-3分词器) | 授权协议 | | :---------------------------------- | :-----------------------------: | :---------------------------------- | | **[Nemotron-CC-Math-v1 4+](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1)** | 514亿Token | [NVIDIA开放数据许可协议](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) | | [MegaMathWeb-Pro](https://huggingface.co/datasets/LLM360/MegaMath) | 130亿Token | 开放数据共同体署名类许可 | | **SwallowMath-v1(此前版本)** | 36亿Token | Llama-3.3社区许可证 | | **SwallowMath-v2(本研究)** | 320亿Token | **Apache 2.0许可证** | ## 📦 本次发布内容 **SwallowMath-v2**:该数据集包含约320亿Token,源自FineMath-3+,其中的数学内容均经过重写,采用Markdown与LaTeX格式呈现简洁的分步解释。 所有数据均以**Apache 2.0许可证**开源发布。 ### 🗂️ 数据集结构 - [stage1-length-filter](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage1-length-filter):基于文本长度对FineMath-3+进行过滤后的子集 - [stage2-extract-math-text](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage2-extract-math-text):基于大语言模型对stage1结果进行数学文本提取后的优化版本 - [stage3-ablations](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-ablations):用于重写风格消融实验的数据集 - [stage3-qa](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-qa):SwallowMath-v2(问答式)数据集(共12,635,739条样本,**136亿Token**) - [stage3-textbook](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2/tree/main/stage3-textbook):SwallowMath-v2(教科书式)数据集(共13,302,336条样本,**183亿Token**) ## 🧩 数据集构建 SwallowMath-v2源自FineMath-3+,后者是从CommonCrawl中过滤得到的高质量数学内容子集。 我们针对数学推理场景定制了**大语言模型驱动的重写流水线**,以此解决原有数据存在的冗余套话、上下文缺失与解释冗长等核心问题。 ### ⚙️ 重写流水线 使用[Qwen3-235B-A22B-2507-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507),重写流水线流程如下: 1. **阶段1 - 长度过滤**:从FineMath-3+中移除过长样本,以适配模型上下文长度限制 2. **阶段2 - 数学文本提取**:基于大语言模型(使用[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B))从阶段1的结果中提取数学文本片段 3. **阶段3 - 重写**:将阶段2的样本重写为问答式与教科书式两种格式 ### 🧪 重写风格消融实验 我们设计并对比了五种重写风格: 1. 教科书式:对定义、示例与解题流程进行结构化呈现 2. 问答式:采用单轮/多轮问答格式 3. 规划式:先给出明确的解题规划,再进行分步推理 4. 苏格拉底式:以师生对话的形式交互式地完成解题 5. 多解法式:生成多种候选解决方案,并对最优解给出论证依据 如下方图表所示,**Q&A**风格在*GSM8K*与*GSM-Plus*基准上表现最优,**Textbook**风格在BBH基准上性能最佳,且两种风格在MATH基准上均表现出色。 基于上述结论,我们最终在SwallowMath-v2数据集中采用了教科书式与问答式两种重写风格。 <img src="rewriting-method.png" width="800"/> ### 问答式示例 以下为问答式重写后的数据集示例。 json **Question 1**: What is the length of the line segment connecting the points $(-2, 4)$ and $(-1, 1)$? **Answer 1**: The length of a line segment between two points $(x_1, y_1)$ and $(x_2, y_2)$ is calculated using the distance formula: $$sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$ For the points $(-2, 4)$ and $(-1, 1)$: $$sqrt{(-1 - (-2))^2 + (1 - 4)^2} = sqrt{1^2 + (-3)^2} = sqrt{1 + 9} = sqrt{10}$$ **Code Implementation 1**: python import math # Calculate distance between (-2, 4) and (-1, 1) x1, y1 = -2, 4 x2, y2 = -1, 1 distance = math.sqrt((x2 - x1)**2 + (y2 - y1)**2) print(f"Distance = {distance:.4f} (exact value: √10 ≈ {math.sqrt(10):.4f})") # Output: Distance = 3.1623 (exact value: √10 ≈ 3.1623) ### 教科书式示例 以下为教科书式重写后的数据集示例。 json ## Background: Arc length is a fundamental concept in calculus that measures the distance along a curved path. Unlike straight-line distances (which use the Pythagorean theorem), curved paths require integration because their slope continuously changes. The formula for arc length derives from approximating a curve with infinitesimally small straight-line segments and summing their lengths—a technique formalized in the 17th century with the development of calculus. This concept is essential in physics (e.g., calculating the path of a moving object), engineering (e.g., designing roads or bridges), and computer graphics (e.g., rendering smooth curves). ## Detailed Explanation: The text explains the mathematical framework for computing the arc length of a curve defined by $y = f(x)$, using both general theory and a specific example. Below is a breakdown of each component, preserving all original formulas and data. ### General Formula The arc length of a curve $y = f(x)$ is given by the integral: $$ int sqrt{1 + left(f'(x) ight)^2} , dx $$ This formula arises from the Pythagorean theorem applied to infinitesimal segments. For a tiny segment of the curve, the horizontal change is $dx$ and the vertical change is $dy = f'(x) , dx$. The length of this segment is $sqrt{(dx)^2 + (dy)^2} = sqrt{(dx)^2 + (f'(x) , dx)^2} = sqrt{1 + (f'(x))^2} , dx$. Summing these infinitesimal lengths via integration gives the total arc length. ### Example: Parabola $y = x^2$ For the parabola $y = x^2$ between $x = -2$ and $x = 2$, the arc length is computed as: $$ int_{-2}^{2} sqrt{1 + (2x)^2} , dx = int_{-2}^{2} sqrt{1 + 4x^2} , dx $$ **Derivation**: - The derivative of $y = x^2$ is $f'(x) = 2x$. - Substituting into the general formula gives $sqrt{1 + (2x)^2} = sqrt{1 + 4x^2}$. - The limits $x = -2$ to $x = 2$ define the interval of integration. This integral evaluates to approximately $9.7$ (as noted later in the text), though the exact value requires trigonometric substitution or numerical methods. ### Concept Check: Line Segment Length The text verifies understanding with a simpler case: the straight-line distance between $(-2, 4)$ and $(-1, 1)$. The calculation is: $$ sqrt{(-1 - (-2))^2 + (1 - 4)^2} = sqrt{1^2 + (-3)^2} = sqrt{10} $$ **Why this matters**: - This is the discrete analog of the arc length formula. For a straight line, the derivative $f'(x)$ is constant, so the integral simplifies to the distance formula. - Here, $Delta x = 1$ and $Delta y = -3$, matching the Pythagorean theorem $sqrt{(Delta x)^2 + (Delta y)^2}$. ### Key Takeaways 1. **Approximation via line segments**: The arc length is approximated by summing tiny line segments: $$ sum sqrt{(Delta x)^2 + (Delta y)^2} $$ As $Delta x o 0$, this sum becomes the integral $int sqrt{1 + left(frac{dy}{dx} ight)^2} , dx$. This is the foundation of the formula. 2. **Derivative's role**: The term $frac{dy}{dx}$ (or $f'(x)$) accounts for the curve's slope. Steeper slopes increase the integrand, reflecting longer path lengths for the same horizontal distance. 3. **Specific application to $y = x^2$**: For $y = x^2$, $frac{dy}{dx} = 2x$, so the integrand becomes $sqrt{1 + (2x)^2} = sqrt{1 + 4x^2}$. This shows how the derivative directly shapes the integral. ### Units of Arc Length The text clarifies that **the unit of arc length matches the unit of the coordinate axes**. For example: - If $x$ and $y$ are measured in inches, the arc length $int_{-2}^{2} sqrt{1 + 4x^2} , dx approx 9.7$ is also in inches. - This holds because both $dx$ and $dy$ inherit the axis units, and the square root operation preserves dimensional consistency. This principle ensures physical meaningfulness in real-world applications (e.g., calculating the length of a wire bent into a parabolic shape). ### 📈 重写模型可扩展性 我们探究了**重写模型的规模**对生成数据质量的影响。 我们使用完全相同的提示词,对比了**Qwen3-30B-A3B**与**Qwen3-235B-A22B**的生成结果,以观察模型规模对输出质量的影响。 实验结果(见下方图表)表明,使用规模更大的重写模型并未显著提升下游任务性能——这说明数据集质量主要由提示词设计与重写风格决定,而非模型规模。<br> <sub>† 我们此前的SwallowMath-v1数据集是使用Llama-3.3-70B-Instruct从FineMath-4+生成的,因此与本次的模型可扩展性实验无直接关联。该数据集规模较小,仅约36亿Token,约为SwallowMath-v2的十分之一,仅作参考展示。</sub> <img src="swallow_math-v2-model-size.jpg" size="800"/> ## 📝 数据使用注意事项 ### 数据集的社会影响 SwallowMath-v2旨在推动高质量数学训练数据的普惠性获取,促进大语言模型数学推理能力的发展。 通过发布开源许可的数据集,我们提升了数据集迭代流程的透明度,并降低了训练具备优秀数学推理能力模型的门槛。 ### 偏差说明 本数据集可能继承自FineMath-3+的固有偏差,包括: - 内容以英文为主 - 特定题型(如代数与几何)的占比可能不均衡。 ## ⚖️ 授权信息 SwallowMath-v2采用**Apache-2.0许可证**开源发布。 ## 👥 贡献者 本数据集主要由以下人员开发: - [Kazuki Fujii](https://www.linkedin.com/in/kazuki-fujii/):负责实验设计、数据流水线实现与实验执行。 - [Yukito Tajima](https://www.linkedin.com/in/yukito-tajima-51bbb2299/):负责数据流水线实现与推理流水线优化(基于vLLM、TensorRT-LLM)。 - [Masaki Kawamura](https://www.linkedin.com/in/masaki-kawamura-0806a7361/):协助实验设计、模型评估与结果可视化及分析。 ## 📖 引用格式 @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, }
提供机构:
maas
创建时间:
2025-11-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作