obswork/arxiv-ocr-benchmark-corpus
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/obswork/arxiv-ocr-benchmark-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-to-text
pretty_name: ArXiv AI/ML OCR benchmark images
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
---
# arxiv-ai-ml-images
Rasterized page images for the arXiv AI/ML OCR benchmark corpus. Pages are
rendered at 144 DPI, encoded as WebP (quality=85,
method=6), and packed into parquet shards with the Hugging
Face `Image` feature so `datasets.load_dataset` decodes them automatically.
Source PDFs: the curated page-bounded dataset `obswork/arxiv-ai-ml-100k-pages`.
## Stats
- **100,056** pages across **4,866** papers
- Categories:
- **cs.AI**: 25,015 pages
- **cs.CV**: 25,011 pages
- **cs.LG**: 25,029 pages
- **stat.ML**: 25,001 pages
## Usage
```python
from datasets import load_dataset
ds = load_dataset("obswork/arxiv-ocr-benchmark-corpus", split="train", streaming=True)
row = next(iter(ds))
print(row["arxiv_id"], row["page_no"], row["image"].size)
```
Each row has: `arxiv_id`, `primary_code`, `yymm`, `page_no`, `width`, `height`,
`sha256`, `image`. Rows are sorted by
`(primary_code, yymm, arxiv_id, page_no)` across shards.
Root files: `manifest.json` (flat page index with `relpath` back to the
volume-native layout), `manifest_meta.json` (counts + manifest sha256), and
`image-plan.json` (paper-level render plan).
提供机构:
obswork



