PadishahIIIXXX/latex-ocr-dataset
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/PadishahIIIXXX/latex-ocr-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: plain
splits:
- name: train
- name: validation
- config_name: styled
splits:
- name: train
- name: validation
configs:
- config_name: plain
data_files:
- split: train
path: plain/train-*
- split: validation
path: plain/validation-*
- config_name: styled
data_files:
- split: train
path: styled/train-*
- split: validation
path: styled/validation-*
license: cc-by-4.0
task_categories:
- image-to-text
size_categories:
- 1M<n<10M
---
# Synthetic LaTeX OCR Dataset



## Overview
This dataset contains synthetically generated LaTeX formula images designed to augment training data for LaTeX OCR models. The dataset applies style enrichment techniques to existing real-world LaTeX datasets, creating diverse visual representations of mathematical formulas through PDF-rendering and font styling.
## Dataset Structure
```
synth/
├── plain/
│ ├── train/
│ │ ├── images/
│ │ │ ├── train_0000000.png
│ │ │ ├── train_0000001.png
│ │ │ └── ...
│ │ └── metadata.jsonl
│ └── validation/
│ ├── images/
│ └── metadata.jsonl
├── styled/
│ ├── train/
│ │ ├── images/
│ │ └── metadata.jsonl
│ └── validation/
│ ├── images/
│ └── metadata.jsonl
└── README.md
```
### Data Format
Each `metadata.jsonl` file contains one JSON object per line:
```json
{"text": "x^2 + y^2 = z^2", "file_name": "images/train_0000001.png"}
{"text": "\\frac{a}{b} + \\frac{c}{d}", "file_name": "images/train_0000002.png"}
{"text": "\\int_{0}^{\\infty} e^{-x} dx", "file_name": "images/train_0000003.png"}
```
**Fields:**
- `text` (str): LaTeX formula string
- `file_name` (str): Relative path to the image file
## Dataset Statistics
### Plain Dataset
| Split | Samples | Avg Length | Median Length | Min Length | Max Length | Std Length |
|-------|---------|------------|---------------|------------|------------|------------|
| Train | 1,035,010 | 146.6 chars | 69.0 chars | 5 chars | 10,916 chars | 225.8 chars |
| Validation | 23,868 | 259.2 chars | 140.0 chars | 5 chars | 5,744 chars | 324.1 chars |
| **Total** | **1,058,878** | **148.5 chars** | **70.0 chars** | **5 chars** | **10,916 chars** | **228.4 chars** |
### Styled Dataset
| Split | Samples | Avg Length | Median Length | Min Length | Max Length | Std Length |
|-------|---------|------------|---------------|------------|------------|------------|
| Train | 139,845 | 115.9 chars | 82.0 chars | 13 chars | 2,765 chars | 118.2 chars |
| Validation | 2,955 | 171.0 chars | 137.0 chars | 14 chars | 1,318 chars | 136.0 chars |
| **Total** | **142,800** | **117.1 chars** | **83.0 chars** | **13 chars** | **2,765 chars** | **118.9 chars** |
### Combined Statistics
- **Total Plain Dataset**: 1,058,878 samples
- **Total Styled Dataset**: 142,800 samples
- **Grand Total**: 1,201,678 samples
## Dataset Constitution
This synthetic dataset is generated from the following sources:
### 1. Source Datasets
#### UniMER-1M
- **Description**: Large-scale mathematical expression recognition dataset
- **Source**: UniMER-1M training set
- **Formulas**: `XXX,XXX` mathematical expressions
- **Coverage**: Diverse mathematical notation including algebra, calculus, geometry, and advanced mathematics
#### LaTeX-OCR Dataset
- **Description**: HuggingFace LaTeX OCR dataset
- **Source**: `lukbl/LaTeX-OCR-dataset`
- **Split**: Training split
- **Formulas**: `XXX,XXX` LaTeX expressions
- **Coverage**: Academic papers, textbooks, and research documents
### 2. Synthetic Plain Dataset
**Generation Method**: PDF-style rendering without font styling
- **Formula Source**: Combined UniMER-1M + LaTeX-OCR datasets
- **Processing**:
- Normalized legacy LaTeX commands (`\bf` → `\mathbf`, etc.)
- Rendered using XeLaTeX with high-fidelity PDF rendering
- Rasterized at 150 DPI for screenshot-style images
- RGB format with standard preprocessing
- **Train/Validation Split**: 90% / 10% (deterministic, seed=42)
- **Usage in Training**: 20% random subset used in mixed training dataset
**Characteristics:**
- Clean, PDF-quality rendering
- Consistent font style (default LaTeX fonts)
- Minimal visual variation
- Serves as baseline/anchor for style robustness
### 3. Synthetic Styled Dataset
**Generation Method**: Style-enriched PDF rendering with font macros
- **Formula Source**: Combined UniMER-1M + LaTeX-OCR datasets
- **Style Injection Strategy** (Section 3.2):
- Random injection of `\mathxx` font macros (`\mathbf`, `\mathbb`, `\mathcal`, `\mathit`, `\mathrm`, `\mathsf`, `\mathtt`, `\mathfrak`, `\mathscr`)
- Semantic heuristics for variable types:
- Sets (R, C, N, Z, Q): 35% `\mathbb`, 15% `\mathcal`, 50% plain
- Vectors (x, y, z, u, v, w, A, B, M): 30% `\mathbf`, 10% `\mathit`, 60% plain
- Operators (d, e, i): 20% `\mathrm`, 80% plain
- Generic: 2% each specialty font, 90% plain
- Global cap: ~40% of identifiers styled per formula
- Consistency rule: Same variable gets same style throughout formula
- **Styled Formula Ratio**: ~10-50% of original formulas produce distinct styled variants
- **Rendering**: XeLaTeX with amsmath, amssymb, mathrsfs, amsfonts packages
- **Anti-hallucination**: Visual contrast verification between plain and styled versions
- **Train/Validation Split**: 90% / 10% (deterministic, seed=42)
- **Usage in Training**: 100% used in mixed training dataset
**Characteristics:**
- Rich visual diversity in mathematical typography
- Realistic font variations from academic publications
- Maintains semantic correctness
- Improves model robustness to font styling
## Generation Pipeline
### Technical Details
1. **Formula Parsing**: Uses `pylatexenc` for safe LaTeX parsing and AST manipulation
2. **Style Injection**: Probability-based identifier selection with semantic heuristics
3. **Rendering**: XeLaTeX compilation with standalone document class
4. **Rasterization**: PDF to PNG conversion at 150 DPI via `pdf2image`
5. **Quality Control**: Skips formulas < 5 characters and failed renders
6. **Splitting**: Deterministic random split (seed=42) for reproducibility
### Rendering Template
```latex
\documentclass[preview,border=2pt]{standalone}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathrsfs} % for \mathscr
\usepackage{amsfonts}
\begin{document}
$$
{formula}
$$
\end{document}
```
## Citation
If you use this synthetic dataset in your research, please cite:
```bibtex
@dataset{synthetic_latex_ocr_2024,
title={Synthetic LaTeX OCR Dataset with Style Enrichment},
author={Your Name},
year={2024},
publisher={GitHub},
howpublished={\url{https://github.com/your-repo}},
note={Generated from UniMER-1M and LaTeX-OCR datasets}
}
```
### Source Dataset Citations
```bibtex
@inproceedings{unimer2024,
title={UniMER: Universal Mathematical Expression Recognition},
author={UniMER Authors},
booktitle={Conference},
year={2024}
}
@dataset{latex_ocr_dataset,
title={LaTeX-OCR Dataset},
author={lukbl},
year={2023},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/datasets/lukbl/LaTeX-OCR-dataset}}
}
```
## License
This dataset is released under the MIT License. See source dataset licenses for additional restrictions:
- UniMER-1M: [Original License]
- LaTeX-OCR Dataset: [Original License]
The synthetic generation code and methodology are provided under MIT License.
## Changelog
### Version 1.0.0 (Initial Release)
- ✨ Initial release with plain and styled variants
- ✨ HuggingFace JSONL format
- ✨ Train/validation splits (90/10)
- ✨ Quality-controlled generation pipeline
- 📊 Total samples: 1,201,678 (Plain: 1,058,878 | Styled: 142,800)
提供机构:
PadishahIIIXXX



