prithivMLmods/OCR-Markdown-Dense-200x

Name: prithivMLmods/OCR-Markdown-Dense-200x
Creator: prithivMLmods
Published: 2026-04-21 13:54:30
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/prithivMLmods/OCR-Markdown-Dense-200x

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: image dtype: image - name: response dtype: string splits: - name: train num_bytes: 231290589 num_examples: 200 download_size: 231146700 dataset_size: 231290589 task_categories: - image-to-text language: - en tags: - ocr - markdown - image size_categories: - n<1K --- # **OCR-Markdown-Dense-200x** ## Overview **OCR-Markdown-Dense-200x** is a synthetic dataset designed for dense document OCR tasks. It focuses on extracting structured **HTML/Markdown representations** from densely packed document pages. The dataset is generated using outputs from open multimodal models, making it suitable for training and evaluating: * Image-to-Text models * Image-to-Markdown/HTML models * Document understanding systems * OCR post-processing pipelines ## Dataset Details * **Task Types**: Image-to-Text, Image-Text-to-Text * **Format**: Image + HTML/Markdown response * **Language**: English * **Size**: ~200 samples * **License**: Apache 2.0 Each sample contains: * `image`: A dense document page * `response`: Corresponding OCR output in HTML/Markdown format ## Usage ```python from datasets import load_dataset # Login using: huggingface-cli login ds = load_dataset("prithivMLmods/OCR-Markdown-Dense-200x") ``` ## Clone Repository ```bash # When prompted for a password, use your Hugging Face access token git clone https://huggingface.co/datasets/prithivMLmods/OCR-Markdown-Dense-200x ``` Generate an access token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) ## Applications This dataset can be used for: * Training OCR models for structured output * Improving Markdown/HTML reconstruction from images * Benchmarking multimodal document models * Fine-tuning LLMs on document parsing tasks ## Notes * The dataset is synthetic and generated using multimodal models * Outputs may contain minor inconsistencies typical of OCR systems * Suitable for experimentation and research purposes ## License This dataset is released under the Apache 2.0 License.

提供机构：

prithivMLmods

5,000+

优质数据集

54 个

任务类型

进入经典数据集