kpurkayastha/OPRB
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kpurkayastha/OPRB
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
task_categories:
- other
tags:
- document-restoration
- occlusion
- document-understanding
- docbank
- layout-analysis
- token-classification
pretty_name: OPRB (Occluded Pages Restoration Benchmark)
size_categories:
- 10K<n<100K
dataset_info:
splits:
- name: train
num_examples: 21078
- name: test
num_examples: 9000
---
# OPRB: Occluded Pages Restoration Benchmark
OPRB is a large-scale benchmark dataset for evaluating document page restoration from physical occlusion artifacts. Given a document page that has been partially obscured (by stamps, ink, whitener fluid, dust, scribbles, etc.), the task is to recover the original token-level annotations — restoring the positions, fonts, and semantic categories of words that have been corrupted or hidden by occlusion.
The dataset is derived from [DocBank](https://doc-analysis.github.io/docbank-page/index.html), a weakly-supervised dataset of arXiv papers with fine-grained token-level layout annotations. OPRB adds synthetic occlusion to DocBank pages, providing paired (occluded ↔ ground-truth) annotation files for training and evaluating restoration models.
---
## Dataset Structure
Each sample consists of **four paired files** sharing the same base name:
| File type | Location | Format | Description |
|-----------|----------|--------|-------------|
| Occluded image | `{split}/{category}/img/` | `.jpg` | Page image **after** occlusion is applied |
| Occluded annotation | `{split}/{category}/annots/` | `.txt` | Token-level annotation of the occluded page |
| Ground truth image | `{split}/{category}/img_gt/` | `.jpg` | Original **clean** page image |
| Ground truth annotation | `{split}/{category}/gt/` | `.txt` | Token-level annotation of the clean page |
The two occlusion categories are:
- **`text/`** — occlusions that primarily overlap with text regions
- **`nontext/`** — occlusions that primarily overlap with non-text (figure, table, equation) regions
### Repository layout
Each directory is stored as a gzip-compressed tar archive to keep the repository compact. All archives together total approximately **9 GB** (images are stored as JPEG so they compress minimally; annotations compress ~10:1).
```
OPRB/
├── README.md
├── train/
│ ├── text/
│ │ ├── img.tar.gz
│ │ ├── img_gt.tar.gz
│ │ ├── annots.tar.gz
│ │ └── gt.tar.gz
│ └── nontext/
│ ├── img.tar.gz
│ ├── img_gt.tar.gz
│ ├── annots.tar.gz
│ └── gt.tar.gz
└── test/
├── text/
│ ├── img.tar.gz
│ ├── img_gt.tar.gz
│ ├── annots.tar.gz
│ └── gt.tar.gz
└── nontext/
├── img.tar.gz
├── img_gt.tar.gz
├── annots.tar.gz
└── gt.tar.gz
```
To extract an archive:
```bash
mkdir -p train/text/img
tar -xzf train_text_img.tar.gz -C train/text/img/
```
### Split statistics
| Split | text samples | nontext samples | Total samples |
|-------|-------------|-----------------|---------------|
| train | 17,212 | 3,866 | 21,078 |
| test | 6,000 | 3,000 | 9,000 |
| **Total** | **23,212** | **6,866** | **30,078** |
Each sample = 1 occluded image + 1 clean image + 1 occluded annotation + 1 clean annotation.
---
## File Format
Each annotation file is a **tab-separated text file** where every row represents one token (word or character) on the page.
### Columns
| # | Field | Description |
|---|-------|-------------|
| 1 | `token` | The text token (word, character, or symbol) |
| 2–9 | `x1 y1 x2 y2 x3 y3 x4 y4` | Eight bounding-box coordinates describing the **rotated quadrilateral** enclosing the token (in page pixel space). For axis-aligned boxes, all four corners collapse to a rectangle. |
| 10 | `font_name` | Name of the font used to render the token (e.g., `CMSY10`, `NimbusRomNo9L-Regu`) |
| 11 | `category` | Semantic document-element category. One of: `title`, `author`, `abstract`, `paragraph`, `section`, `equation`, `figure`, `table`, `caption`, `reference`, `footer`, `date`, `email`, `list` |
## File Naming Convention
```
{OcclusionType}__{density}__{batchId}.tar_{arXivId}.gz_{paperName}_{pageIndex}_ori.txt
```
| Part | Example | Meaning |
|------|---------|---------|
| `OcclusionType` | `White_Whitener` | Type of occlusion applied (see table below) |
| `density` | `2p0pct` | Area fraction of the page that is occluded (`2p0pct` = 2.0%) |
| `batchId` | `105` | DocBank source batch identifier |
| `arXivId` | `1804.06143.gz` | arXiv paper ID |
| `paperName` | `MassiveMIMO` | Shortened paper/file name |
| `pageIndex` | `10` | Zero-based page index within the paper |
---
## Occlusion Types
| Type | Category | Description |
|------|----------|-------------|
| `Black_Scribble` | Dark | Handwritten-style black scribble marks overlaid on the page |
| `Black_Ink` | Dark | Solid black ink blotches or strokes |
| `Through_Stamp` | See Through | Stamp-like markings that partially show through (semi-transparent) |
| `Through_Dust` | See Through | Fine dust or speckle texture simulating aged/dirty documents |
| `White_Burnt` | Light | Burnt / bleached white patches obscuring text |
| `White_Whitener` | Light | White correction-fluid (liquid paper) covering text |
| `Sim` | Synthetic Scribble | Simulated scribble artifacts |
| `Mixed` | Combined | A combination of two or more occlusion types on the same page |
---
## How to Load
### Step 1 — Download and extract archives
```python
from huggingface_hub import hf_hub_download
import tarfile, os
def download_and_extract(repo_id, archive_repo_path, extract_to):
"""Download a tar.gz archive from the Hub and extract it locally."""
os.makedirs(extract_to, exist_ok=True)
local_path = hf_hub_download(
repo_id=repo_id,
filename=archive_repo_path,
repo_type="dataset",
)
with tarfile.open(local_path, "r:gz") as tar:
tar.extractall(extract_to)
REPO = "kpurkayastha/OPRB"
# Download train text split (images + annotations)
download_and_extract(REPO, "train/text/img.tar.gz", "data/train/text/img")
download_and_extract(REPO, "train/text/img_gt.tar.gz", "data/train/text/img_gt")
download_and_extract(REPO, "train/text/annots.tar.gz", "data/train/text/annots")
download_and_extract(REPO, "train/text/gt.tar.gz", "data/train/text/gt")
```
### Step 2 — Load a paired sample
```python
from pathlib import Path
from PIL import Image
def load_annotation(filepath):
"""Load an OPRB annotation file into a list of token dicts."""
tokens = []
with open(filepath, encoding="utf-8") as f:
for line in f:
parts = line.rstrip("\n").split("\t")
if len(parts) < 11:
continue
tokens.append({
"token": parts[0],
"bbox": [int(x) for x in parts[1:9]], # 8 rotated-bbox coords
"font": parts[9],
"category": parts[10],
})
return tokens
base = "White_Whitener__2p0pct__105.tar_1804.06143.gz_MassiveMIMO_10_ori"
# Occluded inputs
occluded_img = Image.open(f"data/train/text/img/{base}.jpg")
occluded_annot = load_annotation(f"data/train/text/annots/{base}.txt")
# Ground truth targets
clean_img = Image.open(f"data/train/text/img_gt/{base}.jpg")
clean_annot = load_annotation(f"data/train/text/gt/{base}.txt")
```
---
## Provenance
OPRB is built on top of **DocBank** ([Li et al., 2020](https://arxiv.org/abs/2006.01038)), a large-scale dataset for document layout analysis containing 500K document pages from arXiv papers with token-level bounding-box and semantic category annotations. Synthetic occlusions were applied programmatically to DocBank pages to produce the OPRB training and test splits.
```bibtex
@dataset{oprb2026,
title = {OPRB: Occluded Pages Restoration Benchmark},
author = {Purkayastha, Kunal},
year = {2026},
url = {https://huggingface.co/datasets/kpurkayastha/OPRB},
}
@inproceedings{li2020docbank,
title = {DocBank: A Benchmark Dataset for Document Layout Analysis},
author = {Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhe and Zhou, Ming},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020},
}
```
---
## License
This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. You are free to share and adapt the material for any purpose, provided appropriate credit is given.
The underlying DocBank annotations are also available under CC BY 4.0.
提供机构:
kpurkayastha



