five

PadishahIIIXXX/latex-ocr-dataset

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/PadishahIIIXXX/latex-ocr-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: plain splits: - name: train - name: validation - config_name: styled splits: - name: train - name: validation configs: - config_name: plain data_files: - split: train path: plain/train-* - split: validation path: plain/validation-* - config_name: styled data_files: - split: train path: styled/train-* - split: validation path: styled/validation-* license: cc-by-4.0 task_categories: - image-to-text size_categories: - 1M<n<10M --- # Synthetic LaTeX OCR Dataset ![Dataset Type](https://img.shields.io/badge/type-synthetic-blue) ![Format](https://img.shields.io/badge/format-HuggingFace%20JSONL-green) ![License](https://img.shields.io/badge/license-MIT-yellow) ## Overview This dataset contains synthetically generated LaTeX formula images designed to augment training data for LaTeX OCR models. The dataset applies style enrichment techniques to existing real-world LaTeX datasets, creating diverse visual representations of mathematical formulas through PDF-rendering and font styling. ## Dataset Structure ``` synth/ ├── plain/ │ ├── train/ │ │ ├── images/ │ │ │ ├── train_0000000.png │ │ │ ├── train_0000001.png │ │ │ └── ... │ │ └── metadata.jsonl │ └── validation/ │ ├── images/ │ └── metadata.jsonl ├── styled/ │ ├── train/ │ │ ├── images/ │ │ └── metadata.jsonl │ └── validation/ │ ├── images/ │ └── metadata.jsonl └── README.md ``` ### Data Format Each `metadata.jsonl` file contains one JSON object per line: ```json {"text": "x^2 + y^2 = z^2", "file_name": "images/train_0000001.png"} {"text": "\\frac{a}{b} + \\frac{c}{d}", "file_name": "images/train_0000002.png"} {"text": "\\int_{0}^{\\infty} e^{-x} dx", "file_name": "images/train_0000003.png"} ``` **Fields:** - `text` (str): LaTeX formula string - `file_name` (str): Relative path to the image file ## Dataset Statistics ### Plain Dataset | Split | Samples | Avg Length | Median Length | Min Length | Max Length | Std Length | |-------|---------|------------|---------------|------------|------------|------------| | Train | 1,035,010 | 146.6 chars | 69.0 chars | 5 chars | 10,916 chars | 225.8 chars | | Validation | 23,868 | 259.2 chars | 140.0 chars | 5 chars | 5,744 chars | 324.1 chars | | **Total** | **1,058,878** | **148.5 chars** | **70.0 chars** | **5 chars** | **10,916 chars** | **228.4 chars** | ### Styled Dataset | Split | Samples | Avg Length | Median Length | Min Length | Max Length | Std Length | |-------|---------|------------|---------------|------------|------------|------------| | Train | 139,845 | 115.9 chars | 82.0 chars | 13 chars | 2,765 chars | 118.2 chars | | Validation | 2,955 | 171.0 chars | 137.0 chars | 14 chars | 1,318 chars | 136.0 chars | | **Total** | **142,800** | **117.1 chars** | **83.0 chars** | **13 chars** | **2,765 chars** | **118.9 chars** | ### Combined Statistics - **Total Plain Dataset**: 1,058,878 samples - **Total Styled Dataset**: 142,800 samples - **Grand Total**: 1,201,678 samples ## Dataset Constitution This synthetic dataset is generated from the following sources: ### 1. Source Datasets #### UniMER-1M - **Description**: Large-scale mathematical expression recognition dataset - **Source**: UniMER-1M training set - **Formulas**: `XXX,XXX` mathematical expressions - **Coverage**: Diverse mathematical notation including algebra, calculus, geometry, and advanced mathematics #### LaTeX-OCR Dataset - **Description**: HuggingFace LaTeX OCR dataset - **Source**: `lukbl/LaTeX-OCR-dataset` - **Split**: Training split - **Formulas**: `XXX,XXX` LaTeX expressions - **Coverage**: Academic papers, textbooks, and research documents ### 2. Synthetic Plain Dataset **Generation Method**: PDF-style rendering without font styling - **Formula Source**: Combined UniMER-1M + LaTeX-OCR datasets - **Processing**: - Normalized legacy LaTeX commands (`\bf` → `\mathbf`, etc.) - Rendered using XeLaTeX with high-fidelity PDF rendering - Rasterized at 150 DPI for screenshot-style images - RGB format with standard preprocessing - **Train/Validation Split**: 90% / 10% (deterministic, seed=42) - **Usage in Training**: 20% random subset used in mixed training dataset **Characteristics:** - Clean, PDF-quality rendering - Consistent font style (default LaTeX fonts) - Minimal visual variation - Serves as baseline/anchor for style robustness ### 3. Synthetic Styled Dataset **Generation Method**: Style-enriched PDF rendering with font macros - **Formula Source**: Combined UniMER-1M + LaTeX-OCR datasets - **Style Injection Strategy** (Section 3.2): - Random injection of `\mathxx` font macros (`\mathbf`, `\mathbb`, `\mathcal`, `\mathit`, `\mathrm`, `\mathsf`, `\mathtt`, `\mathfrak`, `\mathscr`) - Semantic heuristics for variable types: - Sets (R, C, N, Z, Q): 35% `\mathbb`, 15% `\mathcal`, 50% plain - Vectors (x, y, z, u, v, w, A, B, M): 30% `\mathbf`, 10% `\mathit`, 60% plain - Operators (d, e, i): 20% `\mathrm`, 80% plain - Generic: 2% each specialty font, 90% plain - Global cap: ~40% of identifiers styled per formula - Consistency rule: Same variable gets same style throughout formula - **Styled Formula Ratio**: ~10-50% of original formulas produce distinct styled variants - **Rendering**: XeLaTeX with amsmath, amssymb, mathrsfs, amsfonts packages - **Anti-hallucination**: Visual contrast verification between plain and styled versions - **Train/Validation Split**: 90% / 10% (deterministic, seed=42) - **Usage in Training**: 100% used in mixed training dataset **Characteristics:** - Rich visual diversity in mathematical typography - Realistic font variations from academic publications - Maintains semantic correctness - Improves model robustness to font styling ## Generation Pipeline ### Technical Details 1. **Formula Parsing**: Uses `pylatexenc` for safe LaTeX parsing and AST manipulation 2. **Style Injection**: Probability-based identifier selection with semantic heuristics 3. **Rendering**: XeLaTeX compilation with standalone document class 4. **Rasterization**: PDF to PNG conversion at 150 DPI via `pdf2image` 5. **Quality Control**: Skips formulas < 5 characters and failed renders 6. **Splitting**: Deterministic random split (seed=42) for reproducibility ### Rendering Template ```latex \documentclass[preview,border=2pt]{standalone} \usepackage{amsmath} \usepackage{amssymb} \usepackage{mathrsfs} % for \mathscr \usepackage{amsfonts} \begin{document} $$ {formula} $$ \end{document} ``` ## Citation If you use this synthetic dataset in your research, please cite: ```bibtex @dataset{synthetic_latex_ocr_2024, title={Synthetic LaTeX OCR Dataset with Style Enrichment}, author={Your Name}, year={2024}, publisher={GitHub}, howpublished={\url{https://github.com/your-repo}}, note={Generated from UniMER-1M and LaTeX-OCR datasets} } ``` ### Source Dataset Citations ```bibtex @inproceedings{unimer2024, title={UniMER: Universal Mathematical Expression Recognition}, author={UniMER Authors}, booktitle={Conference}, year={2024} } @dataset{latex_ocr_dataset, title={LaTeX-OCR Dataset}, author={lukbl}, year={2023}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/datasets/lukbl/LaTeX-OCR-dataset}} } ``` ## License This dataset is released under the MIT License. See source dataset licenses for additional restrictions: - UniMER-1M: [Original License] - LaTeX-OCR Dataset: [Original License] The synthetic generation code and methodology are provided under MIT License. ## Changelog ### Version 1.0.0 (Initial Release) - ✨ Initial release with plain and styled variants - ✨ HuggingFace JSONL format - ✨ Train/validation splits (90/10) - ✨ Quality-controlled generation pipeline - 📊 Total samples: 1,201,678 (Plain: 1,058,878 | Styled: 142,800)
提供机构:
PadishahIIIXXX
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作