OpenDoc-Pdf-Preview
收藏魔搭社区2026-01-06 更新2025-06-28 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/OpenDoc-Pdf-Preview
下载链接
链接失效反馈官方服务:
资源简介:
# **OpenDoc-Pdf-Preview**
**OpenDoc-Pdf-Preview** is a compact visual preview dataset containing 6,000 high-resolution document images extracted from PDFs. This dataset is designed for **Image-to-Text** tasks such as document OCR pretraining, layout understanding, and multimodal document analysis.
## Dataset Summary
* **Modality:** Image-to-Text
* **Content Type:** PDF-based document previews
* **Number of Samples:** 6,000
* **Language:** English
* **Format:** Parquet
* **Split:** `train` only
* **Size:** 606 MB
* **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
*Each entry consists of:*
* A **preview image** of a PDF page
* A placeholder column named `pdf` (currently appears empty or reserved for future metadata)
## Use Cases
* Pretraining OCR or Document Layout models
* PDF snapshot-based search indexing
* Few-shot document vision evaluation
* Visual prompt tuning for vision-language models (VLMs)
## How to Use
```python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/OpenDoc-Pdf-Preview", split="train")
```
## Notes
* The column `pdf` may be extended in future versions with associated metadata or textual content.
* Each sample preview is rendered from the original PDF file, representing various real-world layouts and formats.
# **OpenDoc-Pdf-Preview**
**OpenDoc-Pdf-Preview** 是一款轻量化视觉预览数据集,包含从PDF文档中提取的6000张高分辨率文档图像。本数据集专为**图像转文本(Image-to-Text)**任务设计,可应用于文档光学字符识别(Optical Character Recognition, OCR)预训练、布局理解以及多模态文档分析等场景。
## 数据集概览
* **模态:** 图像转文本
* **内容类型:** 基于PDF的文档预览
* **样本数量:** 6000
* **语言:** 英语
* **存储格式:** Parquet
* **数据集拆分:** 仅包含训练(train)拆分
* **总大小:** 606 MB
* **开源协议:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
*每条数据条目包含:*
* PDF页面的**预览图像**
* 一个名为`pdf`的占位符列(当前为空或预留用于未来元数据扩展)
## 应用场景
* 光学字符识别(Optical Character Recognition, OCR)或文档布局模型的预训练
* 基于PDF快照的搜索索引构建
* 少样本(Few-shot)文档视觉评估
* 视觉语言模型(Vision-Language Models, VLMs)的视觉提示微调
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/OpenDoc-Pdf-Preview", split="train")
## 注意事项
* 未来版本中,`pdf`列可能会扩展相关元数据或文本内容。
* 每条样本预览图均源自原始PDF文件,涵盖各类真实场景下的文档布局与格式。
提供机构:
maas
创建时间:
2025-06-26



