lightonai/LightOnOCR-mix-0126
收藏Hugging Face2026-01-26 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/LightOnOCR-mix-0126
下载链接
链接失效反馈官方服务:
资源简介:
LightOnOCR-mix-0126 是一个通过蒸馏技术构建的大规模OCR训练数据集,使用强大的视觉-语言模型生成自然顺序的全页转录(Markdown格式,包含LaTeX数学公式和HTML表格)。该数据集设计用于端到端OCR/文档理解模型的监督训练,旨在输出干净、人类可读且格式一致的文本。数据集包含每页的文本转录、结构标记(标题、列表、表格)和数学公式(LaTeX),以及轻量级元数据。数据集支持多种语言,但不包含原始PDF文件。
LightOnOCR-mix-0126 is a large-scale OCR training dataset built via distillation, using a strong vision–language model to produce naturally ordered full-page transcriptions (Markdown with LaTeX math spans and HTML tables). The dataset is designed as supervision for end-to-end OCR / document-understanding models that aim to output clean, human-readable text in a consistent format. Each row corresponds to a single page with text transcription, markup for structure (headers, lists, tables) and math (LaTeX inside math spans), and lightweight metadata. The dataset supports multiple languages and does not include the source PDFs.
提供机构:
lightonai



