Openpdf-Blank-v2.0-Sample

Name: Openpdf-Blank-v2.0-Sample
Creator: maas
Published: 2025-07-03 16:28:57
License: 暂无描述

魔搭社区2025-07-03 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/Openpdf-Blank-v2.0-Sample

下载链接

链接失效反馈

官方服务：

资源简介：

# Openpdf-Blank-v2.0-Sample **Openpdf-Blank-v2.0-Sample** is a sample dataset of blank or near-blank invoice and receipt documents. It contains 255 high-resolution scanned images extracted and cleaned from document PDFs. This dataset is intended to support training and evaluation of OCR, document classification, and layout-based filtering models where blank or structurally minimal pages must be identified and processed. ## Dataset Summary * **Format**: Parquet (auto-converted) * **Modality**: Image * **Size**: 84.8 MB * **Number of Samples**: 255 * **Split**: * `train`: 255 images * **Image Dimensions**: Approximately 1690 x 1690 px * **License**: Apache 2.0 ## Features * Contains scanned images of documents with minimal content or structural layout only. * Suitable for: * Blank page detection * Document filtering * Pre-processing pipeline validation * Background noise training for OCR tasks ## How to Use You can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0-Sample") # Access the first image image = dataset["train"][0]["image"] image.show() ``` Each record in the dataset contains: * `image`: A PIL.Image object of the scanned blank/near-blank page. ## Use Cases * Training models to detect and discard blank or non-informative pages in document workflows. * Evaluating the robustness of OCR pipelines to blank document noise. * Dataset balancing for invoice or receipt classifiers.

# Openpdf-Blank-v2.0-Sample **Openpdf-Blank-v2.0-Sample** 是空白或近空白发票与收据文档的示例数据集，包含从PDF文档中提取并清洗后的255张高分辨率扫描图像。本数据集旨在支持光学字符识别（Optical Character Recognition，OCR）、文档分类以及基于布局的过滤模型的训练与评估，这类模型需识别并处理空白或结构极简的页面。 ## 数据集概览 * **格式**：Parquet（自动转换格式） * **模态**：图像 * **大小**：84.8 MB * **样本数量**：255 * **划分**： * `train`：255张图像 * **图像尺寸**：约1690 × 1690像素 * **许可证**：Apache 2.0 ## 数据集特性 * 包含仅含少量内容或极简结构布局的文档扫描图像。 * 适用场景： * 空白页面检测 * 文档过滤 * 预处理流水线验证 * OCR任务的背景噪声训练 ## 使用方法可通过Hugging Face `datasets`库加载该数据集： python from datasets import load_dataset dataset = load_dataset("prithivMLmods/Openpdf-Blank-v2.0-Sample") # 访问第一张图像 image = dataset["train"][0]["image"] image.show() 数据集中每条记录包含： * `image`：扫描得到的空白/近空白页面的PIL.Image对象。 ## 应用场景 * 训练模型以在文档工作流中检测并丢弃空白或无信息价值的页面。 * 评估OCR流水线对空白文档噪声的鲁棒性。 * 用于发票或收据分类器的数据集均衡。

提供机构：

maas

创建时间：

2025-05-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集