obswork/arxiv-ocr-benchmark-corpus

Name: obswork/arxiv-ocr-benchmark-corpus
Creator: obswork
Published: 2026-04-19 19:02:10
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/obswork/arxiv-ocr-benchmark-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-text pretty_name: ArXiv AI/ML OCR benchmark images configs: - config_name: default data_files: - split: train path: data/train-*.parquet --- # arxiv-ai-ml-images Rasterized page images for the arXiv AI/ML OCR benchmark corpus. Pages are rendered at 144 DPI, encoded as WebP (quality=85, method=6), and packed into parquet shards with the Hugging Face `Image` feature so `datasets.load_dataset` decodes them automatically. Source PDFs: the curated page-bounded dataset `obswork/arxiv-ai-ml-100k-pages`. ## Stats - **100,056** pages across **4,866** papers - Categories: - **cs.AI**: 25,015 pages - **cs.CV**: 25,011 pages - **cs.LG**: 25,029 pages - **stat.ML**: 25,001 pages ## Usage ```python from datasets import load_dataset ds = load_dataset("obswork/arxiv-ocr-benchmark-corpus", split="train", streaming=True) row = next(iter(ds)) print(row["arxiv_id"], row["page_no"], row["image"].size) ``` Each row has: `arxiv_id`, `primary_code`, `yymm`, `page_no`, `width`, `height`, `sha256`, `image`. Rows are sorted by `(primary_code, yymm, arxiv_id, page_no)` across shards. Root files: `manifest.json` (flat page index with `relpath` back to the volume-native layout), `manifest_meta.json` (counts + manifest sha256), and `image-plan.json` (paper-level render plan).

提供机构：

obswork

5,000+

优质数据集

54 个

任务类型

进入经典数据集