five

ocr-pdf-degraded

收藏
魔搭社区2026-01-06 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/racineai/ocr-pdf-degraded
下载链接
链接失效反馈
官方服务:
资源简介:
# OCR-PDF-Degraded Dataset ## Overview This dataset contains synthetically degraded document images paired with their ground truth OCR text. It addresses a critical gap in OCR model training by providing realistic document degradations that simulate real-world conditions encountered in production environments. ## Purpose Most OCR models are trained on relatively clean, perfectly scanned documents. However, in real-world applications, especially in the military/defense sector, documents may be poorly scanned, photographed in suboptimal lighting conditions, or degraded due to environmental factors. This dataset aims to: 1. Enable the training of more robust OCR models that can handle imperfect document inputs 2. Establish a standardized benchmark for evaluating OCR performance under various degradation conditions 3. Bridge the gap between lab performance and real-world deployment for document processing systems ## Domain Focus This first iteration focuses specifically on military/defense sector documents. These documents: - Contain specialized terminology and formatting - Often include tables, diagrams, and structured information - May include mission-critical information where accurate OCR is essential - Represent a sector where document digitization processes may not always be ideal ## Dataset Creation Process The dataset was created through a systematic process of degrading clean PDF documents: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65bd1f3530ed309cb8cba833/zmyfiIIHJCUGBceBAZ3oh.png) The process includes: 1. Starting with clean military/defense PDF documents 2. Extracting individual pages 3. Performing OCR on the clean pages to establish ground truth text 4. Applying various degradation effects to simulate real-world conditions 5. Recording both the degraded images and the corresponding degradation parameters ## Degradation Parameters The dataset includes various degradation types: - **Noise**: Random pixel noise at different intensities - **Lighting**: Uneven illumination effects with varying intensities and positions - **Perspective**: Distortions simulating non-flat document captures - **Artifacts**: Lines, spots, and other common scanner/camera artifacts - **Image Quality**: Variations in blur, brightness, contrast, and JPEG compression Each image in the dataset includes specific parameter values, allowing for targeted evaluation and training. ## Usage Examples This dataset is ideal for: ```python # Example: Loading and using the dataset from datasets import load_dataset import json dataset = load_dataset("racineai/ocr-pdf-degraded", split="train") # Access a sample sample = dataset[0] # Get the degraded image image = sample["image"] # Get the ground truth OCR text text = sample["ocr_text"] # Access degradation parameters (for targeted training/evaluation) params = json.loads(sample["params"]) noise_level = params["noise_level"] print(noise_level) ``` ## Limitations and Future Work - Current iteration focuses only on military/defense documents - Further domain expansion planned for legal, medical, and financial sectors - Future versions may include handwritten text degradations - Working on expansion to include multi-page document context ## Citation If you use this dataset in your research, please cite: ``` @misc{racineai_ocr_pdf_degraded, author = {RacineAI}, title = {OCR-PDF-Degraded: Synthetically Degraded Documents for Robust OCR}, year = {2025}, url = {https://huggingface.co/datasets/racineai/ocr-pdf-degraded} } ``` ## License Apache 2.0

# 合成退化OCR-PDF数据集 ## 概述 本数据集包含合成退化的文档图像及其对应的基准真值光学字符识别(Optical Character Recognition,OCR)文本。当前OCR模型训练常存在关键缺口,本数据集通过模拟生产环境中真实存在的各类文档退化场景,提供符合现实的退化文档数据,填补这一空白。 ## 核心目标 当前多数OCR模型均基于干净规整的扫描文档进行训练,但在实际应用场景(尤其是军事/国防领域)中,文档可能存在扫描质量不佳、在非最优光照条件下拍摄或受环境因素影响而退化等问题。本数据集旨在达成以下目标: 1. 支持训练能够处理非理想文档输入的高鲁棒性OCR模型 2. 构建标准化基准测试集,用于评估不同退化条件下的OCR模型性能 3. 缩小文档处理系统在实验室表现与真实部署效果之间的差距 ## 领域聚焦 本数据集的首个版本仅聚焦于军事/国防领域文档。此类文档具备以下特征: - 包含大量专业术语与特定格式规范 - 通常包含表格、示意图及结构化信息 - 可能承载任务关键型信息,对OCR识别精度有极高要求 - 所属领域的文档数字化流程往往难以达到理想标准 ## 数据集构建流程 本数据集通过系统化的干净PDF文档退化流程生成: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65bd1f3530ed309cb8cba833/zmyfiIIHJCUGBceBAZ3oh.png) 具体流程如下: 1. 以干净的军事/国防领域PDF文档作为初始数据源 2. 提取文档单页内容 3. 对干净页面执行OCR识别,生成基准真值文本 4. 施加多种退化效果,模拟真实世界的文档退化场景 5. 同时保存退化后的图像与对应的退化参数 ## 退化参数说明 本数据集涵盖多种退化类型: - **噪声(Noise)**:不同强度的随机像素噪声 - **光照(Lighting)**:不同强度与位置的非均匀光照效果 - **透视畸变(Perspective)**:模拟非平面文档拍摄的透视扭曲效果 - **伪影(Artifacts)**:线条、斑点等常见扫描仪/相机拍摄伪影 - **图像质量(Image Quality)**:模糊、亮度、对比度及JPEG压缩率的变化 数据集中的每张图像均附带具体的参数取值,可用于定向训练与模型评估 ## 使用示例 本数据集适用于以下场景: python # 示例:数据集加载与使用方法 from datasets import load_dataset import json dataset = load_dataset("racineai/ocr-pdf-degraded", split="train") # 访问单条样本 sample = dataset[0] # 获取退化后的图像 image = sample["image"] # 获取基准真值OCR文本 text = sample["ocr_text"] # 获取退化参数(用于定向训练与评估) params = json.loads(sample["params"]) noise_level = params["noise_level"] print(noise_level) ## 局限性与未来规划 - 当前版本仅覆盖军事/国防领域文档 - 计划后续拓展至法律、医疗及金融领域的文档 - 未来版本将支持手写文本退化场景 - 正在开发多页文档上下文相关的扩展功能 ## 引用声明 若在研究中使用本数据集,请引用以下文献: @misc{racineai_ocr_pdf_degraded, author = {RacineAI}, title = {OCR-PDF-Degraded: Synthetically Degraded Documents for Robust OCR}, year = {2025}, url = {https://huggingface.co/datasets/racineai/ocr-pdf-degraded} } ## 授权协议 Apache 2.0
提供机构:
maas
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作