five

UniMER_Dataset

收藏
魔搭社区2025-12-03 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Virgo-Internal/UniMER_Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# UniMER Dataset For detailed instructions on using the dataset, please refer to the project homepage: [UniMERNet Homepage](https://github.com/opendatalab/UniMERNet/tree/main) ## Introduction The UniMER dataset is a specialized collection curated to advance the field of Mathematical Expression Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows: - **UniMER-1M Training Set:** - Total Samples: 1,061,791 Latex-Image pairs - Composition: A balanced mix of concise and complex, extended formula expressions - Aim: To train robust, high-accuracy MER models, enhancing recognition precision and generalization - **UniMER Test Set:** - Total Samples: 23,757, categorized into four types of expressions: - Simple Printed Expressions (SPE): 6,762 samples - Complex Printed Expressions (CPE): 5,921 samples - Screen Capture Expressions (SCE): 4,742 samples - Handwritten Expressions (HWE): 6,332 samples - Purpose: To provide a thorough evaluation of MER models across a spectrum of real-world conditions ## Visual Data Samples ![UniMER-Test](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## Data Statistics | **Dataset** | **Sub** | **Source** | **Sample Size** | |:-----------:|:-------:|:-------------------------------------------:|:---------------:| | UniMER-1M | | Pix2tex Train | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME Train | 8,834 | | | | HME100K Train ‡ | 74,502 | | UniMER-Test | SPE | Pix2tex Validation | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF Screenshot † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † Indicates data collected, processed, and annotated by our team. ‡ For copyright compliance, please manually download this dataset portion: [HME100K dataset](https://ai.100tal.com/dataset). ## Acknowledgements We would like to express our gratitude to the creators of the [Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf), and [HME100K](https://github.com/tal-tech/SAN) datasets. Their foundational work has significantly contributed to the development of the UniMER dataset. A new metric for evaluating this dataset is presented in [CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation](https://huggingface.co/papers/2409.03643). ## Citations ```text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {\url{https://opendatalab.com}}, year={2022} } ``` --- # UniMER 数据集 数据集使用详细说明请参考项目主页:[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main) ## 简介 UniMER数据集是专门为通用数学表达式识别(MER)发布的数据集。它包含了真实全面的UniMER-1M训练集,拥有超过一百万个代表广泛和复杂数学表达式的实例,以及精心设计的UniMER测试集,用于在真实世界场景中评估MER模型。数据集详情如下: - **UniMER-1M 训练集:** - 总样本数:1,061,791 - 组成:简洁与复杂、扩展公式表达式的平衡融合 - 目标:帮助训练鲁棒性强、高精度的MER模型,增强识别准确性和模型泛化能力 - **UniMER 测试集:** - 总样本数:23,757,分为四种表达式类型: - 简单印刷表达式(SPE):6,762 个样本 - 复杂印刷表达式(CPE):5,921 个样本 - 屏幕截图表达式(SCE):4,742 个样本 - 手写表达式(HWE):6,332 个样本 - 目的:为MER模型提供一个全面的评估平台,以准确评估真实场景下各类公式识别能力 ## 视觉数据样本 ![UniMER-测试集](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## 数据统计 | **数据集** | **子集** | **来源** | **样本数量** | |:-----------:|:-------:|:-------------------------------------------:|:------------:| | UniMER-1M | | Pix2tex 训练集 | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME 训练集 | 8,834 | | | | HME100K 训练集 ‡ | 74,502 | | UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF 截图 † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † 表示由我们团队收集、处理和注释的数据。 ‡ 由于版权合规,请手动下载此部分数据集:[HME100K 数据集](https://ai.100tal.com/dataset)。 ## 致谢 我们对[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR), [CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)和[HME100K](https://github.com/tal-tech/SAN) 数据集的创建者表示感谢。他们的基础工作对 UniMER 数据集的构建及发布做出了重大贡献。 ## 引用 ```text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei, Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {\url{https://opendatalab.com}}, year={2022} } ```

# UniMER 数据集 数据集使用详细说明请参考项目主页:[UniMERNet 主页](https://github.com/opendatalab/UniMERNet/tree/main) ## 简介 UniMER数据集是为推动数学表达式识别(Mathematical Expression Recognition, MER)领域发展而精心打造的专业数据集。它包含规模完备的UniMER-1M训练集,涵盖超百万条覆盖多样且复杂的数学表达式实例,同时搭配精心设计的UniMER测试集,用于在真实场景下对MER模型开展性能基准测试。数据集详情如下: - **UniMER-1M 训练集:** - 总样本量:1,061,791 个LaTeX-图像配对样本 - 样本构成:均衡涵盖简洁式与复杂式等各类扩展公式表达式 - 构建目标:用于训练鲁棒性强、识别精度高的MER模型,提升模型的识别精度与泛化能力 - **UniMER 测试集:** - 总样本量:23,757,分为四类表达式: - 简单印刷体表达式(Simple Printed Expressions, SPE):6,762 个样本 - 复杂印刷体表达式(Complex Printed Expressions, CPE):5,921 个样本 - 屏幕截图式表达式(Screen Capture Expressions, SCE):4,742 个样本 - 手写体表达式(Handwritten Expressions, HWE):6,332 个样本 - 评估目标:用于全面评估MER模型在各类真实场景下的表现 ## 视觉数据样本 ![UniMER-测试集](https://github.com/opendatalab/UniMERNet/assets/69186975/7301df68-e14c-4607-81bc-b6ee3ba1780b) ## 数据统计 | **数据集** | **子集** | **来源** | **样本量** | |:-----------:|:-------:|:-------------------------------------------:|:---------------:| | UniMER-1M | | Pix2tex 训练集 | 158,303 | | | | Arxiv † | 820,152 | | | | CROHME 训练集 | 8,834 | | | | HME100K 训练集 ‡ | 74,502 | | UniMER-测试集 | SPE | Pix2tex 验证集 | 6,762 | | | CPE | Arxiv † | 5,921 | | | SCE | PDF 截图 † | 4,742 | | | HWE | CROHME & HME100K | 6,332 | † 表示由本团队收集、处理并标注的数据。 ‡ 出于版权合规要求,请手动下载该部分数据集:[HME100K 数据集](https://ai.100tal.com/dataset)。 ## 致谢 谨向[Pix2tex](https://github.com/lukas-blecher/LaTeX-OCR)、[CROHME](https://www.cs.rit.edu/~rlaz/files/CROHME+TFD%E2%80%932019.pdf)及[HME100K](https://github.com/tal-tech/SAN)数据集的创作者致以诚挚谢意。他们的奠基性工作为UniMER数据集的构建提供了重要支撑。 本数据集的新型评估指标详见论文[CDM: 面向公平且精准的公式识别评估的可靠指标](https://huggingface.co/papers/2409.03643)。 ## 引用 text @misc{wang2024unimernet, title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition}, author={Bin Wang and Zhuangcheng Gu and Chao Xu and Bo Zhang and Botian Shi and Conghui He}, year={2024}, eprint={2404.15254}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{conghui2022opendatalab, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Wang, Bin and Xu, Chao and Lin, Dahua}, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, howpublished = {url{https://opendatalab.com}}, year={2022} }
提供机构:
maas
创建时间:
2025-08-15
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
UniMER_Dataset是一个专注于数学表达式识别(MER)的大规模数据集,包含1,061,791对Latex-Image训练样本和23,757个测试样本,测试样本分为简单印刷、复杂印刷、屏幕截图和手写四种类型。该数据集旨在训练和评估MER模型在真实场景下的识别能力,数据来源多样,包括公开数据集和团队自收集数据。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作