kpurkayastha/OPRB

Name: kpurkayastha/OPRB
Creator: kpurkayastha
Published: 2026-04-10 07:35:27
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/kpurkayastha/OPRB

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en task_categories: - other tags: - document-restoration - occlusion - document-understanding - docbank - layout-analysis - token-classification pretty_name: OPRB (Occluded Pages Restoration Benchmark) size_categories: - 10K<n<100K dataset_info: splits: - name: train num_examples: 21078 - name: test num_examples: 9000 --- # OPRB: Occluded Pages Restoration Benchmark OPRB is a large-scale benchmark dataset for evaluating document page restoration from physical occlusion artifacts. Given a document page that has been partially obscured (by stamps, ink, whitener fluid, dust, scribbles, etc.), the task is to recover the original token-level annotations — restoring the positions, fonts, and semantic categories of words that have been corrupted or hidden by occlusion. The dataset is derived from [DocBank](https://doc-analysis.github.io/docbank-page/index.html), a weakly-supervised dataset of arXiv papers with fine-grained token-level layout annotations. OPRB adds synthetic occlusion to DocBank pages, providing paired (occluded ↔ ground-truth) annotation files for training and evaluating restoration models. --- ## Dataset Structure Each sample consists of **four paired files** sharing the same base name: | File type | Location | Format | Description | |-----------|----------|--------|-------------| | Occluded image | `{split}/{category}/img/` | `.jpg` | Page image **after** occlusion is applied | | Occluded annotation | `{split}/{category}/annots/` | `.txt` | Token-level annotation of the occluded page | | Ground truth image | `{split}/{category}/img_gt/` | `.jpg` | Original **clean** page image | | Ground truth annotation | `{split}/{category}/gt/` | `.txt` | Token-level annotation of the clean page | The two occlusion categories are: - **`text/`** — occlusions that primarily overlap with text regions - **`nontext/`** — occlusions that primarily overlap with non-text (figure, table, equation) regions ### Repository layout Each directory is stored as a gzip-compressed tar archive to keep the repository compact. All archives together total approximately **9 GB** (images are stored as JPEG so they compress minimally; annotations compress ~10:1). ``` OPRB/ ├── README.md ├── train/ │ ├── text/ │ │ ├── img.tar.gz │ │ ├── img_gt.tar.gz │ │ ├── annots.tar.gz │ │ └── gt.tar.gz │ └── nontext/ │ ├── img.tar.gz │ ├── img_gt.tar.gz │ ├── annots.tar.gz │ └── gt.tar.gz └── test/ ├── text/ │ ├── img.tar.gz │ ├── img_gt.tar.gz │ ├── annots.tar.gz │ └── gt.tar.gz └── nontext/ ├── img.tar.gz ├── img_gt.tar.gz ├── annots.tar.gz └── gt.tar.gz ``` To extract an archive: ```bash mkdir -p train/text/img tar -xzf train_text_img.tar.gz -C train/text/img/ ``` ### Split statistics | Split | text samples | nontext samples | Total samples | |-------|-------------|-----------------|---------------| | train | 17,212 | 3,866 | 21,078 | | test | 6,000 | 3,000 | 9,000 | | **Total** | **23,212** | **6,866** | **30,078** | Each sample = 1 occluded image + 1 clean image + 1 occluded annotation + 1 clean annotation. --- ## File Format Each annotation file is a **tab-separated text file** where every row represents one token (word or character) on the page. ### Columns | # | Field | Description | |---|-------|-------------| | 1 | `token` | The text token (word, character, or symbol) | | 2–9 | `x1 y1 x2 y2 x3 y3 x4 y4` | Eight bounding-box coordinates describing the **rotated quadrilateral** enclosing the token (in page pixel space). For axis-aligned boxes, all four corners collapse to a rectangle. | | 10 | `font_name` | Name of the font used to render the token (e.g., `CMSY10`, `NimbusRomNo9L-Regu`) | | 11 | `category` | Semantic document-element category. One of: `title`, `author`, `abstract`, `paragraph`, `section`, `equation`, `figure`, `table`, `caption`, `reference`, `footer`, `date`, `email`, `list` | ## File Naming Convention ``` {OcclusionType}__{density}__{batchId}.tar_{arXivId}.gz_{paperName}_{pageIndex}_ori.txt ``` | Part | Example | Meaning | |------|---------|---------| | `OcclusionType` | `White_Whitener` | Type of occlusion applied (see table below) | | `density` | `2p0pct` | Area fraction of the page that is occluded (`2p0pct` = 2.0%) | | `batchId` | `105` | DocBank source batch identifier | | `arXivId` | `1804.06143.gz` | arXiv paper ID | | `paperName` | `MassiveMIMO` | Shortened paper/file name | | `pageIndex` | `10` | Zero-based page index within the paper | --- ## Occlusion Types | Type | Category | Description | |------|----------|-------------| | `Black_Scribble` | Dark | Handwritten-style black scribble marks overlaid on the page | | `Black_Ink` | Dark | Solid black ink blotches or strokes | | `Through_Stamp` | See Through | Stamp-like markings that partially show through (semi-transparent) | | `Through_Dust` | See Through | Fine dust or speckle texture simulating aged/dirty documents | | `White_Burnt` | Light | Burnt / bleached white patches obscuring text | | `White_Whitener` | Light | White correction-fluid (liquid paper) covering text | | `Sim` | Synthetic Scribble | Simulated scribble artifacts | | `Mixed` | Combined | A combination of two or more occlusion types on the same page | --- ## How to Load ### Step 1 — Download and extract archives ```python from huggingface_hub import hf_hub_download import tarfile, os def download_and_extract(repo_id, archive_repo_path, extract_to): """Download a tar.gz archive from the Hub and extract it locally.""" os.makedirs(extract_to, exist_ok=True) local_path = hf_hub_download( repo_id=repo_id, filename=archive_repo_path, repo_type="dataset", ) with tarfile.open(local_path, "r:gz") as tar: tar.extractall(extract_to) REPO = "kpurkayastha/OPRB" # Download train text split (images + annotations) download_and_extract(REPO, "train/text/img.tar.gz", "data/train/text/img") download_and_extract(REPO, "train/text/img_gt.tar.gz", "data/train/text/img_gt") download_and_extract(REPO, "train/text/annots.tar.gz", "data/train/text/annots") download_and_extract(REPO, "train/text/gt.tar.gz", "data/train/text/gt") ``` ### Step 2 — Load a paired sample ```python from pathlib import Path from PIL import Image def load_annotation(filepath): """Load an OPRB annotation file into a list of token dicts.""" tokens = [] with open(filepath, encoding="utf-8") as f: for line in f: parts = line.rstrip("\n").split("\t") if len(parts) < 11: continue tokens.append({ "token": parts[0], "bbox": [int(x) for x in parts[1:9]], # 8 rotated-bbox coords "font": parts[9], "category": parts[10], }) return tokens base = "White_Whitener__2p0pct__105.tar_1804.06143.gz_MassiveMIMO_10_ori" # Occluded inputs occluded_img = Image.open(f"data/train/text/img/{base}.jpg") occluded_annot = load_annotation(f"data/train/text/annots/{base}.txt") # Ground truth targets clean_img = Image.open(f"data/train/text/img_gt/{base}.jpg") clean_annot = load_annotation(f"data/train/text/gt/{base}.txt") ``` --- ## Provenance OPRB is built on top of **DocBank** ([Li et al., 2020](https://arxiv.org/abs/2006.01038)), a large-scale dataset for document layout analysis containing 500K document pages from arXiv papers with token-level bounding-box and semantic category annotations. Synthetic occlusions were applied programmatically to DocBank pages to produce the OPRB training and test splits. ```bibtex @dataset{oprb2026, title = {OPRB: Occluded Pages Restoration Benchmark}, author = {Purkayastha, Kunal}, year = {2026}, url = {https://huggingface.co/datasets/kpurkayastha/OPRB}, } @inproceedings{li2020docbank, title = {DocBank: A Benchmark Dataset for Document Layout Analysis}, author = {Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhe and Zhou, Ming}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics}, year = {2020}, } ``` --- ## License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. You are free to share and adapt the material for any purpose, provided appropriate credit is given. The underlying DocBank annotations are also available under CC BY 4.0.

提供机构：

kpurkayastha

5,000+

优质数据集

54 个

任务类型

进入经典数据集