InfiMM-WebMath-40B
收藏魔搭社区2026-01-02 更新2024-09-28 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/InfiMM-WebMath-40B
下载链接
链接失效反馈官方服务:
资源简介:
# InfiMM-WebMath-40B Dataset
[ArXiv](https://arxiv.org/abs/2409.12568)| [PDF](https://arxiv.org/pdf/2409.12568)
This dataset is also discussed in the survey paper [A Survey of Deep Learning for Geometry Problem Solving](https://huggingface.co/papers/2507.11936).
The accompanying reading list/code for the survey can be found at: https://github.com/majianz/gps-survey
**InfiMM-WebMath-40B** is a large-scale, open-source multimodal dataset specifically designed for mathematical reasoning tasks. It incorporates both text and images, extracted from web documents, to advance the pre-training of Multimodal Large Language Models (MLLMs). The dataset is tailored to support sophisticated reasoning tasks that involve understanding both text and visual elements like diagrams, figures, and geometric plots.
## Dataset Overview
The **InfiMM-WebMath-40B** dataset includes:
- **24 million** web documents.
- **85 million** image URLs.
- **40 billion** text tokens.
These documents were sourced from **Common Crawl** data snapshots (2019–2023), filtered to focus on high-quality mathematical and scientific content in both English and Chinese.
## Data Structure
The dataset is organized in a format that captures both text and images in their original order, ensuring accurate interleaving between the two modalities. The structure is as follows:
```json
{
"URL": "...", # The URL of the source document.
"text_list": [...], # List of extracted text segments, None if the element is an image.
"image_list": [...], # List of image URLs, None if the element is a text segment.
"metadata": {...} # Metadata containing information about the extraction process (e.g., processing details, timestamps).
"metadata": { # Metadata containing information about the extraction process (e.g., processing details, timestamps).
"ft_lang_label", # Type of languages detected by fastText
"ft_lang_prob", # Probability of type of language detected by fastText
"math_prob", # First round math content detection with high recal FastText model
"size",
"snap", # Timestamp of Common Crawl snapshot
"text_gpt3_token_len",
"char_repetition_ratio",
"word_repetition_ratio",
"special_character_ratio",
"punctuation_ratio",
"nsfw_num_words", # Number of words which are NSFW
"has_unicode_error", # If there's any unicode error exists
"math_prob_llama3", # Probability of second round math detection with high precision FastText model
}
}
```
### Interleaved Text and Images
The **text_list** and **image_list** are designed as parallel arrays, maintaining the sequence of the document. This interleaving structure allows models to reconstruct the flow of the original document:
- **If `text_list[i]` contains text**, then `image_list[i]` is `None`, indicating that the content at this position is text.
- **If `text_list[i]` is `None`**, then `image_list[i]` contains a URL to an image at that position in the document.
This interleaving of text and images ensures that models trained on this dataset can process the content in the same way a human would, following the logical flow between text explanations and accompanying visual aids.
## Data Collection and Filtering Pipeline
The **InfiMM-WebMath-40B** dataset was created through a comprehensive multi-stage filtering and extraction process, starting with over 120 billion web pages from the Common Crawl repository. The key steps in this pipeline are outlined below::
1. **Language Filtering**: The first step involved filtering for English and Chinese content. We utilized **Trafilatura** to extract text from web pages, and **LangDetect** to efficiently identify the language, ensuring only relevant multilingual content was retained..
2. **High Recall Math Filtering**: To capture as much math-related content as possible, we employed a modified version of **Resiliparse** for HTML parsing. In conjunction with a FastText model optimized for high recall, this phase ensured any potential mathematical data are preserved.
3. **Deduplication**: MinHash were used for fuzzy text deduplication and web page URL exact matching for neighboring Common Crawl snapshots.
4. **Rule-Based Filtering**: This step applied specific filtering rules to remove irrelevant or low-quality content, such as documents containing NSFW material or boilerplate “lorem ipsum,” enhancing the dataset’s overall quality.
5. **High Precision Math Filtering**: A second pass was performed using a FastText model, this time tuned for high precision, to ensure only highly relevant mathematical content remained in the dataset. This refinement step further improved the dataset’s focus and relevance for mathematical reasoning tasks.
6. **Image Filtering**: Finally, rule-based filtering was applied to images, removing irrelevant or extraneous visuals (e.g., logos, banners) to ensure that the remaining images were aligned with the mathematical content.
## How to Use the Dataset
1. **Base Text Download**: The dataset is available for download as a set of web documents with interleaved text and image URLs.
2. **Image Download**: Users need to download images according to the image URLs provided.
### Note
If you want more data with more precision, you can always use higher thresholds with `math_prob` and `math_prob_llama3` fields in `metadata`.
# License
**InfiMM-WebMath-40B** is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: [https://commoncrawl.org/terms-of-use/](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.
# Citation
```
@misc{han2024infimmwebmath40badvancingmultimodalpretraining,
title={InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning},
author={Xiaotian Han and Yiren Jian and Xuefeng Hu and Haogeng Liu and Yiqi Wang and Qihang Fan and Yuang Ai and Huaibo Huang and Ran He and Zhenheng Yang and Quanzeng You},
year={2024},
eprint={2409.12568},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.12568},
}
```
# InfiMM-WebMath-40B 数据集
[ArXiv论文页](https://arxiv.org/abs/2409.12568)| [论文PDF](https://arxiv.org/pdf/2409.12568)
本数据集亦被综述论文《深度学习用于几何问题求解综述》(A Survey of Deep Learning for Geometry Problem Solving)所讨论,该综述配套的阅读清单与代码可从以下地址获取:https://github.com/majianz/gps-survey
**InfiMM-WebMath-40B** 是一款专为数学推理任务打造的大规模开源多模态数据集,其数据源自网络文档,包含文本与图像两类模态,旨在推动多模态大语言模型(Multimodal Large Language Models, MLLMs)的预训练工作。该数据集专为支持复杂推理任务设计,这类任务需要同时理解文本以及图表、图形、几何绘图等视觉元素。
## 数据集概览
**InfiMM-WebMath-40B** 数据集包含:
- **2400万**份网络文档
- **8500万**个图像URL
- **400亿**个文本Token
这些文档源自**Common Crawl** 2019至2023年的数据集快照,经过筛选后仅保留英文与中文的高质量数学及科学内容。
## 数据结构
本数据集以保留原始顺序的方式组织文本与图像,确保两种模态间的精准交错,具体结构如下:
json
{
"URL": "...", # 源文档的URL
"text_list": [...], # 提取的文本片段列表,若当前元素为图像则为None
"image_list": [...], # 图像URL列表,若当前元素为文本片段则为None
"metadata": {...} # 包含提取过程相关信息的元数据(如处理详情、时间戳等)
"metadata": { # 包含提取过程相关信息的元数据(如处理详情、时间戳等)
"ft_lang_label", # fastText检测到的语言类型
"ft_lang_prob", # fastText检测到的语言类型的置信度
"math_prob", # 首轮高召回FastText模型的数学内容检测置信度
"size",
"snap", # Common Crawl数据集快照的时间戳
"text_gpt3_token_len",
"char_repetition_ratio",
"word_repetition_ratio",
"special_character_ratio",
"punctuation_ratio",
"nsfw_num_words", # NSFW(不适当内容)词汇数量
"has_unicode_error", # 是否存在Unicode编码错误
"math_prob_llama3", # 第二轮高精准FastText模型的数学内容检测置信度
}
}
### 文本与图像交错存储
**text_list** 与 **image_list** 采用并行数组设计,完整保留了原文档的内容顺序。这种交错结构可使模型重构原文档的逻辑流:
- 若 **text_list[i]** 包含文本,则 **image_list[i]** 为`None`,表示该位置的内容为文本;
- 若 **text_list[i]** 为`None`,则 **image_list[i]** 包含对应位置图像的URL。
这种文本与图像的交错排布,可让基于本数据集训练的模型以与人类一致的方式处理内容,遵循文本说明与配套视觉辅助材料间的逻辑顺序。
## 数据采集与过滤流程
**InfiMM-WebMath-40B** 数据集通过一套完整的多阶段过滤与提取流程构建,初始数据源为Common Crawl仓库中超过1200亿份网页。该流程的关键步骤如下:
1. **语言过滤**:第一步为筛选英文与中文内容。我们使用**Trafilatura**从网页中提取文本,并通过**LangDetect**高效识别语言,确保仅保留相关的多语言内容。
2. **高召回数学内容过滤**:为尽可能捕获所有数学相关内容,我们采用修改版的**Resiliparse**进行HTML解析,并配合优化后的高召回FastText模型,此阶段确保所有潜在的数学数据均被保留。
3. **去重**:使用MinHash进行模糊文本去重,并针对相邻的Common Crawl数据集快照采用网页URL精确匹配的方式去重。
4. **基于规则的过滤**:此步骤应用特定过滤规则移除无关或低质量内容,例如包含NSFW(不适当内容)材料或“lorem ipsum”这类占位文本的文档,以提升数据集整体质量。
5. **高精准数学内容过滤**:第二轮过滤使用调优后的高精准FastText模型,确保仅保留高度相关的数学内容,进一步强化数据集针对数学推理任务的聚焦性与相关性。
6. **图像过滤**:最后对图像应用基于规则的过滤,移除无关或多余的视觉元素(如标识、横幅),确保剩余图像与数学内容相匹配。
## 数据集使用方法
1. **基础文本下载**:本数据集以包含交错文本与图像URL的网络文档集合形式提供下载。
2. **图像下载**:用户需根据提供的图像URL自行下载图像。
### 补充说明
若需获取更高精度的数据集,可通过调整`metadata`字段中的`math_prob`与`math_prob_llama3`参数阈值实现。
## 许可证
**InfiMM-WebMath-40B** 采用ODC-By 1.0许可证发布;用户同时需遵守CommonCrawl的使用条款:[https://commoncrawl.org/terms-of-use/](https://commoncrawl.org/terms-of-use/)。本项目未修改任何原始数据的许可证。
## 引用
@misc{han2024infimmwebmath40badvancingmultimodalpretraining,
title={InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning},
author={Xiaotian Han and Yiren Jian and Xuefeng Hu and Haogeng Liu and Yiqi Wang and Qihang Fan and Yuang Ai and Huaibo Huang and Ran He and Zhenheng Yang and Quanzeng You},
year={2024},
eprint={2409.12568},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.12568},
}
提供机构:
maas
创建时间:
2024-09-22



