SynthFormulaNet
收藏魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/SynthFormulaNet
下载链接
链接失效反馈官方服务:
资源简介:
# SynthFormulaNet
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/CDdqkx3VNpNTC7naBhweh.png" alt="Formula Example" style="width: 1000px; height: auto; margin-right: 20px;">
</div>
**SynthFormulaNet** is a multimodal dataset designed for training the **SmolDocling** model. It contains over **6.4 million** pairs of synthetically rendered images depicting mathematical formulas and their corresponding LaTeX representations. The LaTeX data was collected from permissively licensed sources, and the images were generated using LaTeX at 120 DPI with diverse rendering styles, fonts, and layout configurations to maximize visual variability. This dataset also includes the [mathwriting](https://arxiv.org/pdf/2404.10690]) dataset rendered at 120DPI.
---
## Dataset Statistics
* **Total samples**: 6,452,704
* **Training set**: 6,130,068
* **Validation set**: 161,317
* **Test set**: 161,319
* **Modalities**: Image, Text
* **Image Generation**: Synthetic (LaTeX)
---
## Data Format
Each dataset entry is structured as follows:
```json
{
"images": [PIL Image],
"texts": [
{
"assistant": "<loc_x0><loc_y0><loc_x1><loc_y1>FORMULA</formula>",
"source": "SynthFormulaNet",
"user": "<formula>"
}
]
}
```
Each formula has been normalized so that each LaTeX symbol is separated by a space.<br>
Example: <br>
C _ { G } ( \Phi , \mathcal { E } ) \leq <br>
Note: Equation numbers (e.g., "(1)", "(2)", "(a)" etc.) that are visually rendered alongside certain formulas are not included in the ground-truth LaTeX representations.
---
## Intended Use
* Training multimodal models for **document understanding**, specifically:
* Formula snippet extraction and transcription to Latex
---
## Citation
If you use SynthFormulaNet, please cite:
```bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
@article{gervais2024mathwriting,
title={Mathwriting: A dataset for handwritten mathematical expression recognition},
author={Gervais, Philippe and Fadeeva, Anastasiia and Maksai, Andrii},
journal={arXiv preprint arXiv:2404.10690},
year={2024}
}
```
# SynthFormulaNet
<div style="display: flex; justify-content: center; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/663e1254887b6f5645a0399f/CDdqkx3VNpNTC7naBhweh.png" alt="Formula Example" style="width: 1000px; height: auto; margin-right: 20px;">
</div>
**SynthFormulaNet** 是一款专为训练**SmolDocling**模型打造的多模态数据集,包含超过640万组合成渲染的数学公式图像及其对应的LaTeX标注文本对。其LaTeX标注数据来源于许可授权的开源数据源,图像均通过LaTeX以120 DPI分辨率渲染生成,并采用多样化的渲染风格、字体与布局配置以最大化视觉多样性。本数据集还涵盖了以120 DPI分辨率渲染的[mathwriting](https://arxiv.org/pdf/2404.10690])数据集。
---
## 数据集统计信息
* **总样本量**:6,452,704
* **训练集**:6,130,068
* **验证集**:161,317
* **测试集**:161,319
* **模态类型**:图像、文本
* **图像生成方式**:合成渲染(基于LaTeX)
---
## 数据格式
每条数据集条目结构如下:
json
{
"images": [PIL图像(PIL Image)],
"texts": [
{
"assistant": "<loc_x0><loc_y0><loc_x1><loc_y1>FORMULA</formula>",
"source": "SynthFormulaNet",
"user": "<formula>"
}
]
}
所有公式均经过归一化处理,每个LaTeX符号之间以空格分隔。示例如下:<br>
C _ { G } ( Phi , mathcal { E } ) leq <br>
注意:与部分公式一同视觉渲染的公式编号(如"(1)", "(2)", "(a)"等)不会包含在真值LaTeX标注中。
---
## 预期用途
用于训练面向**文档理解**任务的多模态模型,具体包括:
* 公式片段提取与LaTeX转录任务
---
## 引用方式
若使用SynthFormulaNet数据集,请引用以下文献:
bibtex
@article{nassar2025smoldocling,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Nassar, Ahmed and Marafioti, Andres and Omenetti, Matteo and Lysak, Maksym and Livathinos, Nikolaos and Auer, Christoph and Morin, Lucas and de Lima, Rafael Teixeira and Kim, Yusik and Gurbuz, A Said and others},
journal={arXiv preprint arXiv:2503.11576},
year={2025}
}
@article{gervais2024mathwriting,
title={Mathwriting: A dataset for handwritten mathematical expression recognition},
author={Gervais, Philippe and Fadeeva, Anastasiia and Maksai, Andrii},
journal={arXiv preprint arXiv:2404.10690},
year={2024}
}
提供机构:
maas
创建时间:
2025-08-01



