five

kpurkayastha/OPRB

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kpurkayastha/OPRB
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en task_categories: - other tags: - document-restoration - occlusion - document-understanding - docbank - layout-analysis - token-classification pretty_name: OPRB (Occluded Pages Restoration Benchmark) size_categories: - 10K<n<100K dataset_info: splits: - name: train num_examples: 21078 - name: test num_examples: 9000 --- # OPRB: Occluded Pages Restoration Benchmark OPRB is a large-scale benchmark dataset for evaluating document page restoration from physical occlusion artifacts. Given a document page that has been partially obscured (by stamps, ink, whitener fluid, dust, scribbles, etc.), the task is to recover the original token-level annotations — restoring the positions, fonts, and semantic categories of words that have been corrupted or hidden by occlusion. The dataset is derived from [DocBank](https://doc-analysis.github.io/docbank-page/index.html), a weakly-supervised dataset of arXiv papers with fine-grained token-level layout annotations. OPRB adds synthetic occlusion to DocBank pages, providing paired (occluded ↔ ground-truth) annotation files for training and evaluating restoration models. --- ## Dataset Structure Each sample consists of **four paired files** sharing the same base name: | File type | Location | Format | Description | |-----------|----------|--------|-------------| | Occluded image | `{split}/{category}/img/` | `.jpg` | Page image **after** occlusion is applied | | Occluded annotation | `{split}/{category}/annots/` | `.txt` | Token-level annotation of the occluded page | | Ground truth image | `{split}/{category}/img_gt/` | `.jpg` | Original **clean** page image | | Ground truth annotation | `{split}/{category}/gt/` | `.txt` | Token-level annotation of the clean page | The two occlusion categories are: - **`text/`** — occlusions that primarily overlap with text regions - **`nontext/`** — occlusions that primarily overlap with non-text (figure, table, equation) regions ### Repository layout Each directory is stored as a gzip-compressed tar archive to keep the repository compact. All archives together total approximately **9 GB** (images are stored as JPEG so they compress minimally; annotations compress ~10:1). ``` OPRB/ ├── README.md ├── train/ │ ├── text/ │ │ ├── img.tar.gz │ │ ├── img_gt.tar.gz │ │ ├── annots.tar.gz │ │ └── gt.tar.gz │ └── nontext/ │ ├── img.tar.gz │ ├── img_gt.tar.gz │ ├── annots.tar.gz │ └── gt.tar.gz └── test/ ├── text/ │ ├── img.tar.gz │ ├── img_gt.tar.gz │ ├── annots.tar.gz │ └── gt.tar.gz └── nontext/ ├── img.tar.gz ├── img_gt.tar.gz ├── annots.tar.gz └── gt.tar.gz ``` To extract an archive: ```bash mkdir -p train/text/img tar -xzf train_text_img.tar.gz -C train/text/img/ ``` ### Split statistics | Split | text samples | nontext samples | Total samples | |-------|-------------|-----------------|---------------| | train | 17,212 | 3,866 | 21,078 | | test | 6,000 | 3,000 | 9,000 | | **Total** | **23,212** | **6,866** | **30,078** | Each sample = 1 occluded image + 1 clean image + 1 occluded annotation + 1 clean annotation. --- ## File Format Each annotation file is a **tab-separated text file** where every row represents one token (word or character) on the page. ### Columns | # | Field | Description | |---|-------|-------------| | 1 | `token` | The text token (word, character, or symbol) | | 2–9 | `x1 y1 x2 y2 x3 y3 x4 y4` | Eight bounding-box coordinates describing the **rotated quadrilateral** enclosing the token (in page pixel space). For axis-aligned boxes, all four corners collapse to a rectangle. | | 10 | `font_name` | Name of the font used to render the token (e.g., `CMSY10`, `NimbusRomNo9L-Regu`) | | 11 | `category` | Semantic document-element category. One of: `title`, `author`, `abstract`, `paragraph`, `section`, `equation`, `figure`, `table`, `caption`, `reference`, `footer`, `date`, `email`, `list` | ## File Naming Convention ``` {OcclusionType}__{density}__{batchId}.tar_{arXivId}.gz_{paperName}_{pageIndex}_ori.txt ``` | Part | Example | Meaning | |------|---------|---------| | `OcclusionType` | `White_Whitener` | Type of occlusion applied (see table below) | | `density` | `2p0pct` | Area fraction of the page that is occluded (`2p0pct` = 2.0%) | | `batchId` | `105` | DocBank source batch identifier | | `arXivId` | `1804.06143.gz` | arXiv paper ID | | `paperName` | `MassiveMIMO` | Shortened paper/file name | | `pageIndex` | `10` | Zero-based page index within the paper | --- ## Occlusion Types | Type | Category | Description | |------|----------|-------------| | `Black_Scribble` | Dark | Handwritten-style black scribble marks overlaid on the page | | `Black_Ink` | Dark | Solid black ink blotches or strokes | | `Through_Stamp` | See Through | Stamp-like markings that partially show through (semi-transparent) | | `Through_Dust` | See Through | Fine dust or speckle texture simulating aged/dirty documents | | `White_Burnt` | Light | Burnt / bleached white patches obscuring text | | `White_Whitener` | Light | White correction-fluid (liquid paper) covering text | | `Sim` | Synthetic Scribble | Simulated scribble artifacts | | `Mixed` | Combined | A combination of two or more occlusion types on the same page | --- ## How to Load ### Step 1 — Download and extract archives ```python from huggingface_hub import hf_hub_download import tarfile, os def download_and_extract(repo_id, archive_repo_path, extract_to): """Download a tar.gz archive from the Hub and extract it locally.""" os.makedirs(extract_to, exist_ok=True) local_path = hf_hub_download( repo_id=repo_id, filename=archive_repo_path, repo_type="dataset", ) with tarfile.open(local_path, "r:gz") as tar: tar.extractall(extract_to) REPO = "kpurkayastha/OPRB" # Download train text split (images + annotations) download_and_extract(REPO, "train/text/img.tar.gz", "data/train/text/img") download_and_extract(REPO, "train/text/img_gt.tar.gz", "data/train/text/img_gt") download_and_extract(REPO, "train/text/annots.tar.gz", "data/train/text/annots") download_and_extract(REPO, "train/text/gt.tar.gz", "data/train/text/gt") ``` ### Step 2 — Load a paired sample ```python from pathlib import Path from PIL import Image def load_annotation(filepath): """Load an OPRB annotation file into a list of token dicts.""" tokens = [] with open(filepath, encoding="utf-8") as f: for line in f: parts = line.rstrip("\n").split("\t") if len(parts) < 11: continue tokens.append({ "token": parts[0], "bbox": [int(x) for x in parts[1:9]], # 8 rotated-bbox coords "font": parts[9], "category": parts[10], }) return tokens base = "White_Whitener__2p0pct__105.tar_1804.06143.gz_MassiveMIMO_10_ori" # Occluded inputs occluded_img = Image.open(f"data/train/text/img/{base}.jpg") occluded_annot = load_annotation(f"data/train/text/annots/{base}.txt") # Ground truth targets clean_img = Image.open(f"data/train/text/img_gt/{base}.jpg") clean_annot = load_annotation(f"data/train/text/gt/{base}.txt") ``` --- ## Provenance OPRB is built on top of **DocBank** ([Li et al., 2020](https://arxiv.org/abs/2006.01038)), a large-scale dataset for document layout analysis containing 500K document pages from arXiv papers with token-level bounding-box and semantic category annotations. Synthetic occlusions were applied programmatically to DocBank pages to produce the OPRB training and test splits. ```bibtex @dataset{oprb2026, title = {OPRB: Occluded Pages Restoration Benchmark}, author = {Purkayastha, Kunal}, year = {2026}, url = {https://huggingface.co/datasets/kpurkayastha/OPRB}, } @inproceedings{li2020docbank, title = {DocBank: A Benchmark Dataset for Document Layout Analysis}, author = {Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhe and Zhou, Ming}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics}, year = {2020}, } ``` --- ## License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. You are free to share and adapt the material for any purpose, provided appropriate credit is given. The underlying DocBank annotations are also available under CC BY 4.0.
提供机构:
kpurkayastha
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作