prithivMLmods/OCR-Markdown-Dense-200x
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/prithivMLmods/OCR-Markdown-Dense-200x
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: image
dtype: image
- name: response
dtype: string
splits:
- name: train
num_bytes: 231290589
num_examples: 200
download_size: 231146700
dataset_size: 231290589
task_categories:
- image-to-text
language:
- en
tags:
- ocr
- markdown
- image
size_categories:
- n<1K
---
# **OCR-Markdown-Dense-200x**
## Overview
**OCR-Markdown-Dense-200x** is a synthetic dataset designed for dense document OCR tasks. It focuses on extracting structured **HTML/Markdown representations** from densely packed document pages.
The dataset is generated using outputs from open multimodal models, making it suitable for training and evaluating:
* Image-to-Text models
* Image-to-Markdown/HTML models
* Document understanding systems
* OCR post-processing pipelines
## Dataset Details
* **Task Types**: Image-to-Text, Image-Text-to-Text
* **Format**: Image + HTML/Markdown response
* **Language**: English
* **Size**: ~200 samples
* **License**: Apache 2.0
Each sample contains:
* `image`: A dense document page
* `response`: Corresponding OCR output in HTML/Markdown format
## Usage
```python
from datasets import load_dataset
# Login using: huggingface-cli login
ds = load_dataset("prithivMLmods/OCR-Markdown-Dense-200x")
```
## Clone Repository
```bash
# When prompted for a password, use your Hugging Face access token
git clone https://huggingface.co/datasets/prithivMLmods/OCR-Markdown-Dense-200x
```
Generate an access token from:
[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
## Applications
This dataset can be used for:
* Training OCR models for structured output
* Improving Markdown/HTML reconstruction from images
* Benchmarking multimodal document models
* Fine-tuning LLMs on document parsing tasks
## Notes
* The dataset is synthetic and generated using multimodal models
* Outputs may contain minor inconsistencies typical of OCR systems
* Suitable for experimentation and research purposes
## License
This dataset is released under the Apache 2.0 License.
提供机构:
prithivMLmods



