finemath
收藏魔搭社区2026-05-02 更新2024-12-21 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/finemath
下载链接
链接失效反馈官方服务:
资源简介:
# 📐 FineMath

## What is it?
📐 FineMath consists of **34B tokens** (FineMath-3+) and **54B tokens** (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. To curate this dataset, we trained a mathematical content [classifier](https://huggingface.co/HuggingFaceTB/finemath-classifier) using annotations generated by LLama-3.1-70B-Instruct. We used the classifier to retain only the most educational mathematics content, focusing on clear explanations and step-by-step problem solving rather than advanced academic papers.
The [Dataset Curation](#dataset-curation) section details the process for creating the dataset. More details in our paper: https://arxiv.org/abs/2502.02737v1.
<img src="assets/train_curves.png" width="800"/>
## What is being released?
The dataset is released in two versions:
- **FineMath-3+**: 34B tokens, 21.4M documents containing mathematical reasoning and problem solving, formatted with Markdown and LaTeX.
- **FineMath-4+** (a subset of FineMath-3+): 9.6B tokens, 6.7M documents of higher quality with detailed explanations. Models trained on this dataset perform better on GSM8k and MATH.
<!-- (the image looks kinda meh) <img src="assets/stats.png" width="512"/> -->
We also release a filtered English text-only portion of the **[InfiMM-WebMath-40B](https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B)** dataset, classified using the same approach as FineMath:
- **InfiMM-WebMath-3+**: 20.5B tokens, 13.9M documents.
- **InfiMM-WebMath-4+** (a subset of InfiMM-WebMath-3+): 8.5B tokens, 6.3M documents.
## How to load the dataset
Use one of the available configs: `finemath-3plus`, `finemath-4plus`, `infiwebmath-3plus`, or `infiwebmath-4plus`.
```python
from datasets import load_dataset
# Load the high-quality subset
data = load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train", num_proc=8)
# Or load the larger subset
data = load_dataset("HuggingFaceTB/finemath", "finemath-3plus", split="train", num_proc=8)
```
## Dataset curation
Recent language models like DeepSeekMath and MathStral have demonstrated strong mathematical capabilities, trained on specialized datasets that aren't publicly available. We developed a pipeline to identify and extract high-quality mathematical content from CommonCrawl, with several iterations of refinement to improve quality.
### Phase 1: Initial content extraction and classification
We began by re-extracting pages from CommonCrawl WARCs using URLs from the FineWeb dataset, collecting both the latest and largest versions of each page to capture the evolution of pages across the years.
Unlike FineWeb which uses Trafilatura, we employed Resiliparse for text extraction as it better preserves forum discussions and QA answers that often contain crucial reasoning steps and solutions.
For initial quality assessment, we used [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) to generate annotations on a 3-point scale:
1. Contains general mathematical content
2. Shows logical reasoning in mathematical context
3. Contains clear step-by-step solutions at appropriate level
A `multilingual-e5-small`-based classifier finetuned on these annotations was used to score the initial corpus.
However, this first version performed below the OpenWebMath baseline, leading to several important refinements.
### Phase 2: Recalling more candidate pages
Analysis revealed that FineWeb's C4 filter removes pages containing '{' characters, inadvertently filtering out content with LaTeX notation. To address this and expand coverage, we:
1. Identified promising website domains by selecting those where at least 10% of pages received a classifier score ≥ 2
2. Added URLs from OpenWebMath and InfiMM-WebMath datasets
3. Recovered URLs of pages filtered by FineWeb's '{' rule from its rejection logs
4. Re-extracted all content from scratch using the [OpenWebMath pipeline](https://github.com/keirp/OpenWebMath), which properly handles mathematical notation across various HTML markup formats and standardizes them to LaTeX
### Phase 3: Refined quality assessment
The expanded corpus underwent a more fine-grained quality evaluation:
Once again, we used LLama-3.1-70B-Instruct to score a sample of newly extracted pages on a 5-point scale (full prompt available in [here](assets/prompt.txt)):
We finetuned a new [classifier](https://huggingface.co/HuggingFaceTB/finemath-classifier) on these annotations and scored the entire corpus.
After leaving only pages with a score of 3 or higher, and deduplicating the samples using simple single-band MinHash-LSH, we obtained FineMath-3+ with 34B tokens.
The same classifier was applied to InfiMM-WebMath's text content, focusing more on reasoning rather than advanced mathematics.
Both datasets were additionally filtered using FineWeb's language classification pipeline to remove non-English content.
### Decontamination
Following Qwen2.5-Math's approach, we removed samples with 13-gram overlaps against test sets from GSM8k, MATH, MMLU and ARC. Decontamination logs are available at [HuggingFaceTB/finemath_contamination_report](https://huggingface.co/datasets/HuggingFaceTB/finemath_contamination_report).
## Results and Performance
<img src="assets/eval_bar.png" width="600"/>
Our evaluations show several key findings:
1. FineMath-3+ outperforms the base InfiWebMath on GSM8k and MATH benchmarks
2. FineMath-4+ demonstrates superior performance compared to both FineMath-3+ and InfiWebMath-4+ on GSM8k and MATH
3. Combining the datasets (50% FineMath-3+ with 50% InfiWebMath-3+) yields approximately 50B tokens while matching the performance of FineMath-3+
4. Deduplicating the pages repeated between FineMath and InfiWebMath reduces performance compared to a non-deduplicated combination
## Dataset Schema
```python
{
'url': string, # Source page URL
'fetch_time': int64, # Crawler timestamp
'content_mime_type': string, # MIME type
'warc_filename': string, # Common Crawl WARC source file
'warc_record_offset': int32, # WARC record offset, in bytes
'warc_record_length': int32, # WARC record size, in bytes
'text': string, # Page content
'token_count': int32, # Number of Llama tokens
'char_count': int32, # Character count
'metadata': string, # Additional OpenWebMath metadata
'score': float64, # Raw quality score
'int_score': int64, # Integer quality score
'crawl': string, # Common Crawl crawl identifier
'snapshot_type': string, # Whether the page is the latest or the largest for this URL
'language': string, # Document language
'language_score': float64 # LangID probability
}
```
## Considerations for Using the Data
### Social Impact of Dataset
With the release of this dataset, we aim to make high-quality mathematical educational content more accessible to the machine learning community. While multiple language models have demonstrated strong mathematical capabilities, the datasets used to train these capabilities are often not publicly available. By releasing FineMath, we hope to:
- Make the dataset creation process more transparent
- Reduce the barrier to entry for training models with strong mathematical capabilities
- Provide a benchmark for mathematical content quality filtering
### Discussion of Biases
The dataset may have certain inherent biases:
- Focus on English language content
- Emphasis on popular educational approaches to mathematics
- Bias towards certain types of mathematical notation and formatting
### Other Known Limitations
- The dataset is limited to English language content
- The filtering criteria may not capture advanced mathematical content (e.g. advanced research subjects)
- Some mathematical notation (e.g. image-based) may not be preserved
- Long-form content may have varying quality even within high-scoring documents
## Licensing Information
The dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** [license](https://opendatacommons.org/licenses/by/1-0/). The use of this dataset is also subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use).
## Future work
There are several avenues for future work:
- Expand language coverage beyond English
- Improve mathematical notation extraction and preservation
- Develop more sophisticated quality metrics
- Create specialized subsets for different educational levels
### Citation Information
```
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
```
# 📐 FineMath

## 📐 什么是FineMath?
FineMath包含**340亿Token(Token)**(FineMath-3+)与**540亿Token(Token)**(FineMath-3+搭配InfiMM-WebMath-3+)的数学教育内容,均从CommonCrawl中筛选得到。为构建该数据集,我们使用Llama-3.1-70B-Instruct生成的标注训练了一个数学内容分类器(classifier),并通过该分类器仅保留最具教育价值的数学内容——聚焦于清晰的讲解与逐步解题过程,而非高级学术论文。
数据集构建流程(Dataset Curation)小节详细介绍了该数据集的构建过程,更多细节可参考我们的论文:https://arxiv.org/abs/2502.02737v1。
<img src="assets/train_curves.png" width="800"/>
## 📦 本次发布的内容是什么?
本次发布的数据集包含两个版本:
- **FineMath-3+**:340亿Token,2140万份文档,涵盖数学推理与解题内容,采用Markdown与LaTeX格式。
- **FineMath-4+**(FineMath-3+的子集):96亿Token,670万份高质量文档,附带详细讲解。使用该数据集训练的模型在GSM8k与MATH基准测试中表现更优。
<!-- (该图片效果欠佳)<img src="assets/stats.png" width="512"/> -->
此外,我们还发布了经同一分类方法过滤后的**[InfiMM-WebMath-40B](https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B)**数据集的纯英文文本部分:
- **InfiMM-WebMath-3+**:205亿Token,1390万份文档。
- **InfiMM-WebMath-4+**(InfiMM-WebMath-3+的子集):85亿Token,630万份文档。
## 数据集加载方式
可使用以下任一配置加载数据集:`finemath-3plus`、`finemath-4plus`、`infiwebmath-3plus`或`infiwebmath-4plus`。
python
from datasets import load_dataset
# 加载高质量子集
data = load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train", num_proc=8)
# 或加载更大的子集
data = load_dataset("HuggingFaceTB/finemath", "finemath-3plus", split="train", num_proc=8)
## 数据集构建流程
近期的语言模型如DeepSeekMath与MathStral已展现出优异的数学能力,其训练依赖于未公开的专用数据集。我们开发了一套流水线,用于从CommonCrawl中识别并提取高质量数学内容,并通过多轮迭代优化提升内容质量。
### 阶段1:初始内容提取与分类
我们首先通过FineWeb数据集的URL从CommonCrawl的WARC文件中重新提取页面,收集每个页面的最新版本与最大版本,以捕获页面多年来的演化历程。与FineWeb使用Trafilatura不同,我们采用Resiliparse进行文本提取,因其能更好地保留论坛讨论与问答内容——这类内容往往包含关键推理步骤与解题方案。
在初始质量评估阶段,我们使用Llama-3.1-70B-Instruct对样本生成3级评分标注:
1. 包含通用数学内容
2. 展现数学场景下的逻辑推理
3. 包含适配受众的清晰逐步解题过程
我们基于这些标注微调了一个基于`multilingual-e5-small`的分类器,用于对初始语料库评分。但该初代分类器的表现低于OpenWebMath基准,因此我们进行了多项重要优化。
### 阶段2:召回更多候选页面
分析发现FineWeb的C4过滤器会移除包含`{`字符的页面,无意间过滤掉了含LaTeX符号的内容。为解决该问题并扩大覆盖范围,我们:
1. 筛选出至少10%页面评分≥2的优质域名
2. 添加OpenWebMath与InfiMM-WebMath数据集的URL
3. 从FineWeb的拒绝日志中恢复被`{`规则过滤的URL
4. 使用OpenWebMath流水线重新从头提取所有内容,该流水线可妥善处理各类HTML标记格式下的数学符号,并将其标准化为LaTeX格式
### 阶段3:精细化质量评估
扩展后的语料库接受了更细粒度的质量校验:
我们再次使用Llama-3.1-70B-Instruct对新提取的页面样本进行5级评分(完整提示见assets/prompt.txt)。基于这些标注微调了全新的分类器(classifier),并对整个语料库评分。仅保留评分≥3的页面,并使用单波段MinHash-LSH对样本去重后,我们得到了包含340亿Token的FineMath-3+。
将同一分类器应用于InfiMM-WebMath的文本内容,更侧重推理而非高级数学内容。此外,我们使用FineWeb的语言分类流水线过滤掉非英文内容。
### 数据去重与污染处理
遵循Qwen2.5-Math的做法,我们移除了与GSM8k、MATH、MMLU及ARC测试集存在13-gram重叠的样本。去重日志可在[HuggingFaceTB/finemath_contamination_report](https://huggingface.co/datasets/HuggingFaceTB/finemath_contamination_report)查看。
## 实验结果与性能表现
<img src="assets/eval_bar.png" width="600"/>
我们的评估得到以下关键结论:
1. FineMath-3+在GSM8k与MATH基准测试中优于基础版InfiWebMath
2. FineMath-4+在GSM8k与MATH基准上的表现优于FineMath-3+与InfiWebMath-4+
3. 将两个数据集按50% FineMath-3+与50% InfiWebMath-3+混合,可得到约500亿Token的语料库,性能与FineMath-3+持平
4. 对FineMath与InfiWebMath间的重复页面去重后,性能反而低于未去重的混合版本
## 数据集Schema结构
python
{
'url': string, # 源页面URL
'fetch_time': int64, # 爬虫抓取时间戳
'content_mime_type': string, # MIME类型
'warc_filename': string, # CommonCrawl WARC源文件名称
'warc_record_offset': int32, # WARC记录偏移量(单位:字节)
'warc_record_length': int32, # WARC记录大小(单位:字节)
'text': string, # 页面内容
'token_count': int32, # Llama Token计数
'char_count': int32, # 字符数
'metadata': string, # 附加OpenWebMath元数据
'score': float64, # 原始质量评分
'int_score': int64, # 整数质量评分
'crawl': string, # CommonCrawl爬取批次标识符
'snapshot_type': string, # 页面类型:该URL的最新版本或最大版本
'language': string, # 文档语言
'language_score': float64 # 语言识别概率
}
## 数据集使用注意事项
### 数据集的社会影响
我们发布该数据集旨在让机器学习社区更易获取高质量数学教育内容。尽管已有多款语言模型展现出优异的数学能力,但支撑这些能力的训练数据集往往未公开。通过发布FineMath,我们希望:
- 提升数据集构建流程的透明度
- 降低训练具备强数学能力模型的门槛
- 为数学内容质量过滤提供基准测试方案
### 偏差说明
该数据集可能存在固有偏差:
- 以英文内容为主
- 侧重主流数学教育方法
- 偏向特定类型的数学符号与格式
### 已知局限性
- 数据集仅包含英文内容
- 筛选标准可能无法覆盖高级数学内容(如前沿研究主题)
- 部分数学符号(如基于图像的符号)可能无法被保留
- 即使在高分文档中,长格式内容的质量也可能存在差异
## 许可协议
本数据集采用**开放数据Commons署名许可协议(Open Data Commons Attribution License (ODC-By) v1.0)**发布,使用本数据集同时需遵守CommonCrawl的使用条款。
## 未来工作方向
未来可开展的研究方向包括:
- 扩展语言覆盖范围,突破英文限制
- 优化数学符号的提取与保留能力
- 开发更精细的质量评估指标
- 针对不同教育层级构建专用子集
### 引用信息
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
提供机构:
maas
创建时间:
2024-12-20



