andynoodles/omnidoc-ocr-correction-bench
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/andynoodles/omnidoc-ocr-correction-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: image
dtype: image
num_examples: 1355
license: cc-by-4.0
task_categories:
- image-to-text
tags:
- ocr
- document-understanding
- markdown
- paddleocr
- omnidocbench
size_categories:
- 1K<n<10K
---
# OmniDoc OCR Correction Bench
A benchmark dataset for evaluating VLMs on OCR error correction and document-to-markdown formatting.
## Overview
Each sample pairs a document image from [OmniDocBench v1.5](https://github.com/opendatalab/OmniDocBench) with a prompt containing PaddleOCR-extracted markdown text. The task is to correct OCR errors and restore proper formatting using the source image as reference.
## Dataset Structure
| Field | Type | Description |
|-------|------|-------------|
| `prompt` | `string` | System prompt with OCR-extracted markdown text to correct |
| `image` | `image` | Source document image |
## Sources
- **Images**: [OmniDocBench v1.5](https://github.com/opendatalab/OmniDocBench) — covers books, papers, exams, newspapers, magazines, PPTs, notes, textbooks, and financial reports
- **OCR extraction**: PaddleOCR with markdown output
- **Prompts**: Custom correction prompts instructing the model to fix OCR errors while preserving document structure
提供机构:
andynoodles



