Openpdf-Blank-v2.0-Sample
收藏魔搭社区2025-07-03 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Openpdf-Blank-v2.0-Sample
下载链接
链接失效反馈官方服务:
资源简介:
# Openpdf-Blank-v2.0-Sample
**Openpdf-Blank-v2.0-Sample** is a sample dataset of blank or near-blank invoice and receipt documents. It contains 255 high-resolution scanned images extracted and cleaned from document PDFs. This dataset is intended to support training and evaluation of OCR, document classification, and layout-based filtering models where blank or structurally minimal pages must be identified and processed.
## Dataset Summary
* **Format**: Parquet (auto-converted)
* **Modality**: Image
* **Size**: 84.8 MB
* **Number of Samples**: 255
* **Split**:
* `train`: 255 images
* **Image Dimensions**: Approximately 1690 x 1690 px
* **License**: Apache 2.0
## Features
* Contains scanned images of documents with minimal content or structural layout only.
* Suitable for:
* Blank page detection
* Document filtering
* Pre-processing pipeline validation
* Background noise training for OCR tasks
## How to Use
You can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0-Sample")
# Access the first image
image = dataset["train"][0]["image"]
image.show()
```
Each record in the dataset contains:
* `image`: A PIL.Image object of the scanned blank/near-blank page.
## Use Cases
* Training models to detect and discard blank or non-informative pages in document workflows.
* Evaluating the robustness of OCR pipelines to blank document noise.
* Dataset balancing for invoice or receipt classifiers.
# Openpdf-Blank-v2.0-Sample
**Openpdf-Blank-v2.0-Sample** 是空白或近空白发票与收据文档的示例数据集,包含从PDF文档中提取并清洗后的255张高分辨率扫描图像。本数据集旨在支持光学字符识别(Optical Character Recognition,OCR)、文档分类以及基于布局的过滤模型的训练与评估,这类模型需识别并处理空白或结构极简的页面。
## 数据集概览
* **格式**:Parquet(自动转换格式)
* **模态**:图像
* **大小**:84.8 MB
* **样本数量**:255
* **划分**:
* `train`:255张图像
* **图像尺寸**:约1690 × 1690像素
* **许可证**:Apache 2.0
## 数据集特性
* 包含仅含少量内容或极简结构布局的文档扫描图像。
* 适用场景:
* 空白页面检测
* 文档过滤
* 预处理流水线验证
* OCR任务的背景噪声训练
## 使用方法
可通过Hugging Face `datasets`库加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0-Sample")
# 访问第一张图像
image = dataset["train"][0]["image"]
image.show()
数据集中每条记录包含:
* `image`:扫描得到的空白/近空白页面的PIL.Image对象。
## 应用场景
* 训练模型以在文档工作流中检测并丢弃空白或无信息价值的页面。
* 评估OCR流水线对空白文档噪声的鲁棒性。
* 用于发票或收据分类器的数据集均衡。
提供机构:
maas
创建时间:
2025-05-26



