five

harryrobert/latex-ocr-aug

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/harryrobert/latex-ocr-aug
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - image-to-text language: - en tags: - latex - ocr - math - formula-recognition - augmentation size_categories: - 1M<n<10M --- # latex-ocr-aug A large-scale LaTeX OCR dataset with multiple augmentation variants, designed for training image-to-LaTeX models. Contains over **1.38M** training samples across five augmentation levels, plus validation and test splits. ## Dataset Summary | Split | Subset | Samples | Shards | |------------|--------------|-----------|--------| | train | raw | 1,389,527 | 28 | | train | light | 1,389,527 | 28 | | train | heavy | 1,389,527 | 28 | | train | light_text | 1,389,527 | 56 | | train | heavy_text | 1,389,527 | 56 | | validation | — | 77,195 | 2 | | test | — | 77,195 | 2 | ## Dataset Structure ``` latex-ocr-aug/ ├── train/ │ ├── raw/ # No augmentation — original rendered formula images │ ├── light/ # Light augmentation (mild noise, slight blur, small rotation) │ ├── heavy/ # Heavy augmentation (strong distortion, shadow, perspective) │ ├── light_text/ # Light augmentation + surrounding text context │ └── heavy_text/ # Heavy augmentation + surrounding text context ├── validation/ # Held-out validation split └── test/ # Held-out test split ``` Each parquet file contains the following columns: | Column | Type | Description | |-----------|--------|------------------------------------------| | `image` | bytes | PNG image of the rendered LaTeX formula | | `latex` | string | Ground-truth LaTeX source string | ## Augmentation Levels - **raw**: Clean renders with no augmentation. Use for baseline evaluation. - **light**: Mild augmentations — slight blur, small brightness/contrast jitter, minimal rotation. Suitable for general training. - **heavy**: Strong augmentations — heavy distortion, shadows, perspective warp, ink simulation. Designed for robustness. - **light_text / heavy_text**: Same as light/heavy but the formula image is embedded inside a larger document-like context with surrounding text, simulating real-world document scanning. ## Usage ### Load a specific subset ```python from datasets import load_dataset # Load raw train split ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="train/raw", split="train") # Load heavy augmentation ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="train/heavy", split="train") # Load validation ds = load_dataset("harryrobert/latex-ocr-aug", data_dir="validation", split="train") ``` ### Iterate samples ```python for sample in ds: image = sample["image"] # PIL image or bytes latex = sample["latex"] # LaTeX string ``` ## Intended Use This dataset is intended for training and evaluating sequence-to-sequence models that convert formula images to LaTeX, such as: - Encoder-decoder transformers (e.g., TrOCR, Donut, custom ViT + decoder) - Autoregressive decoder models fine-tuned on formula recognition The multiple augmentation variants allow training with curriculum learning (start on `raw` or `light`, gradually introduce `heavy`) or multi-task sampling across subsets. ## License MIT
提供机构:
harryrobert
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作