UniMER_Dataset

Name: UniMER_Dataset
Creator: maas
Published: 2025-12-03 17:22:24
License: 暂无描述

魔搭社区2025-12-03 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/Virgo-Internal/UniMER_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# UniMER Dataset For detailed instructions on using the dataset, please refer to the project homepage: [UniMERNet Homepage](https://github.com/opendatalab/UniMERNet/tree/main) ## Introduction The UniMER dataset is a specialized collection curated to advance the field of Mathematical Expression Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows: - **UniMER-1M Training Set:** - Total Samples: 1,061,791 Latex-Image pairs - Composition: A balanced mix of concise and complex, extended formula expressions - Aim: To train robust, high-accuracy MER models, enhancing recognition precision and generalization - **UniMER Test Set:** - Total Samples: 23,757, categorized into four types of expressions: - Simple Printed Expressions (SPE): 6,762 samples - Complex Printed Expressions (CPE): 5,921 samples - Screen Capture Expressions (SCE): 4,742 samples - Handwritten Expressions (HWE): 6,332 samples - Purpose: To provide a thorough evaluation of MER models across a spectrum of real-world conditions ## Visual Data Samples ![UniMER-Test](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## Data Statistics | **Dataset** | **Sub** | **Source** | **Sample Size** | |:-----------:|:-------:|:-------------------------------------------:|:---------------:| | UniMER-1M | | Pix2tex Train | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME Train | 8,834 | | | | HME100K Train ‡ | 74,502 | | UniMER-Test | SPE | Pix2tex Validation | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF Screenshot † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † Indicates data collected, processed, and annotated by our team. ‡ For copyright compliance, please manually download this dataset portion: [HME100K dataset](https://ai.100tal.com/dataset). ## Acknowledgements We would like to express our gratitude to the creators of the [Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf), and [HME100K](https://github.com/tal-tech/SAN) datasets. Their foundational work has significantly contributed to the development of the UniMER dataset. A new metric for evaluating this dataset is presented in [CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation](https://huggingface.co/papers/2409.03643). ## Citations ```text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {\url{https://opendatalab.com}}, year={2022} } ``` --- # UniMER 数据集数据集使用详细说明请参考项目主页：[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main) ## 简介 UniMER数据集是专门为通用数学表达式识别（MER）发布的数据集。它包含了真实全面的UniMER-1M训练集，拥有超过一百万个代表广泛和复杂数学表达式的实例，以及精心设计的UniMER测试集，用于在真实世界场景中评估MER模型。数据集详情如下： - **UniMER-1M 训练集：** - 总样本数：1,061,791 - 组成：简洁与复杂、扩展公式表达式的平衡融合 - 目标：帮助训练鲁棒性强、高精度的MER模型，增强识别准确性和模型泛化能力 - **UniMER 测试集：** - 总样本数：23,757，分为四种表达式类型： - 简单印刷表达式（SPE）：6,762 个样本 - 复杂印刷表达式（CPE）：5,921 个样本 - 屏幕截图表达式（SCE）：4,742 个样本 - 手写表达式（HWE）：6,332 个样本 - 目的：为MER模型提供一个全面的评估平台，以准确评估真实场景下各类公式识别能力 ## 视觉数据样本 ![UniMER-测试集](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## 数据统计 | **数据集** | **子集** | **来源** | **样本数量** | |:-----------:|:-------:|:-------------------------------------------:|:------------:| | UniMER-1M | | Pix2tex 训练集 | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME 训练集 | 8,834 | | | | HME100K 训练集 ‡ | 74,502 | | UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF 截图 † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † 表示由我们团队收集、处理和注释的数据。 ‡ 由于版权合规，请手动下载此部分数据集：[HME100K 数据集](https://ai.100tal.com/dataset)。 ## 致谢我们对[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)和[HME100K](https://github.com/tal-tech/SAN) 数据集的创建者表示感谢。他们的基础工作对 UniMER 数据集的构建及发布做出了重大贡献。 ## 引用 ```text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei, Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {\url{https://opendatalab.com}}, year={2022} } ```

# UniMER 数据集数据集使用详细说明请参考项目主页：[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main) ## 简介 UniMER数据集是为推动数学表达式识别（Mathematical Expression Recognition, MER）领域发展而精心打造的专业数据集。它包含规模完备的UniMER-1M训练集，涵盖超百万条覆盖多样且复杂的数学表达式实例，同时搭配精心设计的UniMER测试集，用于在真实场景下对MER模型开展性能基准测试。数据集详情如下： - **UniMER-1M 训练集：** - 总样本量：1,061,791 个LaTeX-图像配对样本 - 样本构成：均衡涵盖简洁式与复杂式等各类扩展公式表达式 - 构建目标：用于训练鲁棒性强、识别精度高的MER模型，提升模型的识别精度与泛化能力 - **UniMER 测试集：** - 总样本量：23,757，分为四类表达式： - 简单印刷体表达式（Simple Printed Expressions, SPE）：6,762 个样本 - 复杂印刷体表达式（Complex Printed Expressions, CPE）：5,921 个样本 - 屏幕截图式表达式（Screen Capture Expressions, SCE）：4,742 个样本 - 手写体表达式（Handwritten Expressions, HWE）：6,332 个样本 - 评估目标：用于全面评估MER模型在各类真实场景下的表现 ## 视觉数据样本 ![UniMER-测试集](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## 数据统计 | **数据集** | **子集** | **来源** | **样本量** | |:-----------:|:-------:|:-------------------------------------------:|:---------------:| | UniMER-1M | | Pix2tex 训练集 | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME 训练集 | 8,834 | | | | HME100K 训练集 ‡ | 74,502 | | UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF 截图 † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † 表示由本团队收集、处理并标注的数据。 ‡ 出于版权合规要求，请手动下载该部分数据集：[HME100K 数据集](https://ai.100tal.com/dataset)。 ## 致谢谨向[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR)、[CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)及[HME100K](https://github.com/tal-tech/SAN)数据集的创作者致以诚挚谢意。他们的奠基性工作为UniMER数据集的构建提供了重要支撑。本数据集的新型评估指标详见论文[CDM: 面向公平且精准的公式识别评估的可靠指标](https://huggingface.co/papers/2409.03643)。 ## 引用 text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {url{https://opendatalab.com}}, year={2022} }

提供机构：

maas

创建时间：

2025-08-15

搜集汇总

数据集介绍

背景与挑战

背景概述

UniMER_Dataset是一个专注于数学表达式识别（MER）的大规模数据集，包含1,061,791对Latex-Image训练样本和23,757个测试样本，测试样本分为简单印刷、复杂印刷、屏幕截图和手写四种类型。该数据集旨在训练和评估MER模型在真实场景下的识别能力，数据来源多样，包括公开数据集和团队自收集数据。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集