lightonai/LightOnOCR-bbox-mix-0126
收藏Hugging Face2026-01-21 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/LightOnOCR-bbox-mix-0126
下载链接
链接失效反馈官方服务:
资源简介:
LightOnOCR-bbox-mix-0126是一个大规模OCR训练数据集,包含布局信息,通过蒸馏方法构建:使用强大的视觉-语言模型生成自然排序的全页转录(带有LaTeX数学公式和HTML表格的Markdown格式)。该数据集旨在为输出干净、人类可读且格式一致的端到端OCR/文档理解模型提供监督。数据集包括自然阅读顺序的文本转录、结构和数学的标记以及轻量级元数据。数据集不包含源PDF文件,仅提供文本目标和相关元数据。数据集适用于训练OCR模型、研究OCR对科学标记的鲁棒性以及基准测试格式化技术,但不适用于重建原始PDF或未经进一步验证的高风险应用。数据集存在一些限制,如偶尔的幻觉或格式化错误,以及在不同脚本上的性能差异。
LightOnOCR-bbox-mix-0126 is a large-scale OCR training dataset including layout information, built via distillation: a strong vision–language model is prompted to produce naturally ordered full-page transcriptions (Markdown with LaTeX math spans and HTML tables) from rendered document pages. The dataset is designed as supervision for end-to-end OCR / document-understanding models that aim to output clean, human-readable text in a consistent format. It includes text transcriptions in natural reading order, markup for structure and math, and lightweight metadata. The dataset does not include the source PDFs but provides text targets and associated metadata. The dataset is intended for training OCR models, studying OCR robustness to scientific markup, and benchmarking formatting techniques, but not for reconstructing original PDFs or high-stakes applications without further validation. The dataset has limitations, such as occasional hallucinations or formatting errors and variable performance across different scripts.
提供机构:
lightonai



