Openpdf-Blank-v2.0
收藏魔搭社区2025-07-03 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Openpdf-Blank-v2.0
下载链接
链接失效反馈官方服务:
资源简介:
# Openpdf-Blank-v2.0
Openpdf-Blank-v2.0 is a small dataset containing blank or near-blank PDF image samples. This dataset is primarily designed to help train and evaluate document processing models, especially in tasks like:
* Identifying and filtering blank or noise-filled documents.
* Preprocessing stages for OCR pipelines.
* Receipt/document classification tasks.
## Dataset Structure
* **Modality**: Image
* **Languages**: English (if applicable)
* **Size**: Less than 1,000 samples
* **License**: Apache-2.0
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0")
```
## Intended Use
The dataset is ideal for training or benchmarking models that need to recognize and filter out blank pages in document images, such as in:
* Invoice preprocessing pipelines
* Bulk document OCR systems
* Smart scanning tools
# Openpdf-Blank-v2.0
Openpdf-Blank-v2.0 是一款小型数据集,收录空白或近乎空白的PDF图像样本。本数据集主要用于辅助训练与评估文档处理模型,尤其适用于以下任务:
* 识别并过滤空白或含噪文档
* 光学字符识别(Optical Character Recognition,OCR)流水线的预处理环节
* 收据与文档分类任务
## 数据集结构
* **模态**:图像
* **语言**:仅适用于英语(如适用)
* **规模**:样本总量不足1000个
* **授权协议**:Apache-2.0
## 使用方式
可通过以下代码加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0")
## 预期用途
本数据集非常适合用于训练或基准测试需要识别并过滤文档图像中空白页的模型,典型应用场景包括:
* 发票预处理流水线
* 批量文档OCR系统
* 智能扫描工具
提供机构:
maas
创建时间:
2025-05-26



