harryrobert/latex-ocr-aug
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/harryrobert/latex-ocr-aug
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-to-text
language:
- en
tags:
- latex
- ocr
- math
- formula-recognition
- augmentation
size_categories:
- 1M<n<10M
---
# latex-ocr-aug
A large-scale LaTeX OCR dataset with multiple augmentation variants, designed for training image-to-LaTeX models. Contains over **1.38M** training samples across five augmentation levels, plus validation and test splits.
## Dataset Summary
| Split | Subset | Samples | Shards |
|------------|--------------|-----------|--------|
| train | raw | 1,389,527 | 28 |
| train | light | 1,389,527 | 28 |
| train | heavy | 1,389,527 | 28 |
| train | light_text | 1,389,527 | 56 |
| train | heavy_text | 1,389,527 | 56 |
| validation | — | 77,195 | 2 |
| test | — | 77,195 | 2 |
## Dataset Structure
```
latex-ocr-aug/
├── train/
│ ├── raw/ # No augmentation — original rendered formula images
│ ├── light/ # Light augmentation (mild noise, slight blur, small rotation)
│ ├── heavy/ # Heavy augmentation (strong distortion, shadow, perspective)
│ ├── light_text/ # Light augmentation + surrounding text context
│ └── heavy_text/ # Heavy augmentation + surrounding text context
├── validation/ # Held-out validation split
└── test/ # Held-out test split
```
Each parquet file contains the following columns:
| Column | Type | Description |
|-----------|--------|------------------------------------------|
| `image` | bytes | PNG image of the rendered LaTeX formula |
| `latex` | string | Ground-truth LaTeX source string |
## Augmentation Levels
- **raw**: Clean renders with no augmentation. Use for baseline evaluation.
- **light**: Mild augmentations — slight blur, small brightness/contrast jitter, minimal rotation. Suitable for general training.
- **heavy**: Strong augmentations — heavy distortion, shadows, perspective warp, ink simulation. Designed for robustness.
- **light_text / heavy_text**: Same as light/heavy but the formula image is embedded inside a larger document-like context with surrounding text, simulating real-world document scanning.
## Usage
### Load a specific subset
```python
from datasets import load_dataset
# Load raw train split
ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="train/raw", split="train")
# Load heavy augmentation
ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="train/heavy", split="train")
# Load validation
ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="validation", split="train")
```
### Iterate samples
```python
for sample in ds:
image = sample["image"] # PIL image or bytes
latex = sample["latex"] # LaTeX string
```
## Intended Use
This dataset is intended for training and evaluating sequence-to-sequence models that convert formula images to LaTeX, such as:
- Encoder-decoder transformers (e.g., TrOCR, Donut, custom ViT + decoder)
- Autoregressive decoder models fine-tuned on formula recognition
The multiple augmentation variants allow training with curriculum learning (start on `raw` or `light`, gradually introduce `heavy`) or multi-task sampling across subsets.
## License
MIT
提供机构:
harryrobert



