UniMER_Dataset
收藏魔搭社区2025-12-03 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Virgo-Internal/UniMER_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# UniMER Dataset
For detailed instructions on using the dataset, please refer to the project homepage: [UniMERNet Homepage](https://github.com/opendatalab/UniMERNet/tree/main)
## Introduction
The UniMER dataset is a specialized collection curated to advance the field of Mathematical Expression Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows:
- **UniMER-1M Training Set:**
- Total Samples: 1,061,791 Latex-Image pairs
- Composition: A balanced mix of concise and complex, extended formula expressions
- Aim: To train robust, high-accuracy MER models, enhancing recognition precision and generalization
- **UniMER Test Set:**
- Total Samples: 23,757, categorized into four types of expressions:
- Simple Printed Expressions (SPE): 6,762 samples
- Complex Printed Expressions (CPE): 5,921 samples
- Screen Capture Expressions (SCE): 4,742 samples
- Handwritten Expressions (HWE): 6,332 samples
- Purpose: To provide a thorough evaluation of MER models across a spectrum of real-world conditions
## Visual Data Samples

## Data Statistics
| **Dataset** | **Sub** | **Source** | **Sample Size** |
|:-----------:|:-------:|:-------------------------------------------:|:---------------:|
| UniMER-1M | | Pix2tex Train | 158,303 |
| | | Arxiv † | 820,152 |
| | | CROHME Train | 8,834 |
| | | HME100K Train ‡ | 74,502 |
| UniMER-Test | SPE | Pix2tex Validation | 6,762 |
| | CPE | Arxiv † | 5,921 |
| | SCE | PDF Screenshot † | 4,742 |
| | HWE | CROHME & HME100K | 6,332 |
† Indicates data collected, processed, and annotated by our team.
‡ For copyright compliance, please manually download this dataset portion: [HME100K dataset](https://ai.100tal.com/dataset).
## Acknowledgements
We would like to express our gratitude to the creators of the [Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf), and [HME100K](https://github.com/tal-tech/SAN) datasets. Their foundational work has significantly contributed to the development of the UniMER dataset.
A new metric for evaluating this dataset is presented in [CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation](https://huggingface.co/papers/2409.03643).
## Citations
```text
@misc{wang2024unimernet,
title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition},
author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He},
year={2024},
eprint={2404.15254},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{conghui2022opendatalab,
author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua},
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
howpublished = {\url{https://opendatalab.com}},
year={2022}
}
```
---
# UniMER 数据集
数据集使用详细说明请参考项目主页:[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main)
## 简介
UniMER数据集是专门为通用数学表达式识别(MER)发布的数据集。它包含了真实全面的UniMER-1M训练集,拥有超过一百万个代表广泛和复杂数学表达式的实例,以及精心设计的UniMER测试集,用于在真实世界场景中评估MER模型。数据集详情如下:
- **UniMER-1M 训练集:**
- 总样本数:1,061,791
- 组成:简洁与复杂、扩展公式表达式的平衡融合
- 目标:帮助训练鲁棒性强、高精度的MER模型,增强识别准确性和模型泛化能力
- **UniMER 测试集:**
- 总样本数:23,757,分为四种表达式类型:
- 简单印刷表达式(SPE):6,762 个样本
- 复杂印刷表达式(CPE):5,921 个样本
- 屏幕截图表达式(SCE):4,742 个样本
- 手写表达式(HWE):6,332 个样本
- 目的:为MER模型提供一个全面的评估平台,以准确评估真实场景下各类公式识别能力
## 视觉数据样本

## 数据统计
| **数据集** | **子集** | **来源** | **样本数量** |
|:-----------:|:-------:|:-------------------------------------------:|:------------:|
| UniMER-1M | | Pix2tex 训练集 | 158,303 |
| | | Arxiv † | 820,152 |
| | | CROHME 训练集 | 8,834 |
| | | HME100K 训练集 ‡ | 74,502 |
| UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 |
| | CPE | Arxiv † | 5,921 |
| | SCE | PDF 截图 † | 4,742 |
| | HWE | CROHME & HME100K | 6,332 |
† 表示由我们团队收集、处理和注释的数据。
‡ 由于版权合规,请手动下载此部分数据集:[HME100K 数据集](https://ai.100tal.com/dataset)。
## 致谢
我们对[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)和[HME100K](https://github.com/tal-tech/SAN) 数据集的创建者表示感谢。他们的基础工作对 UniMER 数据集的构建及发布做出了重大贡献。
## 引用
```text
@misc{wang2024unimernet,
title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition},
author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He},
year={2024},
eprint={2404.15254},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{conghui2022opendatalab,
author={He, Conghui and Li, Wei, Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua},
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
howpublished = {\url{https://opendatalab.com}},
year={2022}
}
```
# UniMER 数据集
数据集使用详细说明请参考项目主页:[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main)
## 简介
UniMER数据集是为推动数学表达式识别(Mathematical Expression Recognition, MER)领域发展而精心打造的专业数据集。它包含规模完备的UniMER-1M训练集,涵盖超百万条覆盖多样且复杂的数学表达式实例,同时搭配精心设计的UniMER测试集,用于在真实场景下对MER模型开展性能基准测试。数据集详情如下:
- **UniMER-1M 训练集:**
- 总样本量:1,061,791 个LaTeX-图像配对样本
- 样本构成:均衡涵盖简洁式与复杂式等各类扩展公式表达式
- 构建目标:用于训练鲁棒性强、识别精度高的MER模型,提升模型的识别精度与泛化能力
- **UniMER 测试集:**
- 总样本量:23,757,分为四类表达式:
- 简单印刷体表达式(Simple Printed Expressions, SPE):6,762 个样本
- 复杂印刷体表达式(Complex Printed Expressions, CPE):5,921 个样本
- 屏幕截图式表达式(Screen Capture Expressions, SCE):4,742 个样本
- 手写体表达式(Handwritten Expressions, HWE):6,332 个样本
- 评估目标:用于全面评估MER模型在各类真实场景下的表现
## 视觉数据样本

## 数据统计
| **数据集** | **子集** | **来源** | **样本量** |
|:-----------:|:-------:|:-------------------------------------------:|:---------------:|
| UniMER-1M | | Pix2tex 训练集 | 158,303 |
| | | Arxiv † | 820,152 |
| | | CROHME 训练集 | 8,834 |
| | | HME100K 训练集 ‡ | 74,502 |
| UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 |
| | CPE | Arxiv † | 5,921 |
| | SCE | PDF 截图 † | 4,742 |
| | HWE | CROHME & HME100K | 6,332 |
† 表示由本团队收集、处理并标注的数据。
‡ 出于版权合规要求,请手动下载该部分数据集:[HME100K 数据集](https://ai.100tal.com/dataset)。
## 致谢
谨向[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR)、[CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)及[HME100K](https://github.com/tal-tech/SAN)数据集的创作者致以诚挚谢意。他们的奠基性工作为UniMER数据集的构建提供了重要支撑。
本数据集的新型评估指标详见论文[CDM: 面向公平且精准的公式识别评估的可靠指标](https://huggingface.co/papers/2409.03643)。
## 引用
text
@misc{wang2024unimernet,
title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition},
author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He},
year={2024},
eprint={2404.15254},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{conghui2022opendatalab,
author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua},
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
howpublished = {url{https://opendatalab.com}},
year={2022}
}
提供机构:
maas
创建时间:
2025-08-15
搜集汇总
数据集介绍

背景与挑战
背景概述
UniMER_Dataset是一个专注于数学表达式识别(MER)的大规模数据集,包含1,061,791对Latex-Image训练样本和23,757个测试样本,测试样本分为简单印刷、复杂印刷、屏幕截图和手写四种类型。该数据集旨在训练和评估MER模型在真实场景下的识别能力,数据来源多样,包括公开数据集和团队自收集数据。
以上内容由遇见数据集搜集并总结生成



