five

prithivMLmods/OCR-Markdown-Dense-200x

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/prithivMLmods/OCR-Markdown-Dense-200x
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: image dtype: image - name: response dtype: string splits: - name: train num_bytes: 231290589 num_examples: 200 download_size: 231146700 dataset_size: 231290589 task_categories: - image-to-text language: - en tags: - ocr - markdown - image size_categories: - n<1K --- # **OCR-Markdown-Dense-200x** ## Overview **OCR-Markdown-Dense-200x** is a synthetic dataset designed for dense document OCR tasks. It focuses on extracting structured **HTML/Markdown representations** from densely packed document pages. The dataset is generated using outputs from open multimodal models, making it suitable for training and evaluating: * Image-to-Text models * Image-to-Markdown/HTML models * Document understanding systems * OCR post-processing pipelines ## Dataset Details * **Task Types**: Image-to-Text, Image-Text-to-Text * **Format**: Image + HTML/Markdown response * **Language**: English * **Size**: ~200 samples * **License**: Apache 2.0 Each sample contains: * `image`: A dense document page * `response`: Corresponding OCR output in HTML/Markdown format ## Usage ```python from datasets import load_dataset # Login using: huggingface-cli login ds = load_dataset("prithivMLmods/OCR-Markdown-Dense-200x") ``` ## Clone Repository ```bash # When prompted for a password, use your Hugging Face access token git clone https://huggingface.co/datasets/prithivMLmods/OCR-Markdown-Dense-200x ``` Generate an access token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) ## Applications This dataset can be used for: * Training OCR models for structured output * Improving Markdown/HTML reconstruction from images * Benchmarking multimodal document models * Fine-tuning LLMs on document parsing tasks ## Notes * The dataset is synthetic and generated using multimodal models * Outputs may contain minor inconsistencies typical of OCR systems * Suitable for experimentation and research purposes ## License This dataset is released under the Apache 2.0 License.
提供机构:
prithivMLmods
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作