five

MaxTex: A Large Scale Benchmark Dataset for Mathematical Formula Recognition

收藏
DataCite Commons2025-04-13 更新2024-11-06 收录
下载链接:
https://figshare.com/articles/dataset/MaxTex_A_Large_Scale_Benchmark_Dataset_for_Mathematical_Formula_Recognition/27321039/2
下载链接
链接失效反馈
官方服务:
资源简介:
Mathematical formula recognition is an important component of document understanding and has broad application value in academic literature processing and intelligent education However, existing research mainly focuses on improving the model architecture to enhance the recognition performance of relatively simple formulas, ignoring the limitations of existing benchmark datasets in terms of scale, quality, and diversity, which limits the development of complex formula recognition technology This article has made two key contributions. Firstly, high-quality printed MaxTex (P) and handwritten scanned MaxTex (H) datasets have been constructed MaxTex (P) contains 223000 samples and avoids symbol redundancy by adopting a unified and efficient morpheme design; Although MaxTex (H) has a moderate scale, it optimizes the morpheme space and covers complex mathematical expressions At the same time, these two datasets have been strictly controlled in terms of sample size, data quality, and annotation accuracy, providing a more reliable benchmark for model training and evaluation Secondly, an innovative character sequence encoding and decoding scheme was designed to solve the problems of missing spaces in existing LaTeX label sequences and dictionary inflation caused by BPE encoding and decoding, while preserving the semantic information of the original character sequence
提供机构:
figshare
创建时间:
2024-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作