five

obswork/arxiv-ocr-benchmark-corpus

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/obswork/arxiv-ocr-benchmark-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - image-to-text pretty_name: ArXiv AI/ML OCR benchmark images configs: - config_name: default data_files: - split: train path: data/train-*.parquet --- # arxiv-ai-ml-images Rasterized page images for the arXiv AI/ML OCR benchmark corpus. Pages are rendered at 144 DPI, encoded as WebP (quality=85, method=6), and packed into parquet shards with the Hugging Face `Image` feature so `datasets.load_dataset` decodes them automatically. Source PDFs: the curated page-bounded dataset `obswork/arxiv-ai-ml-100k-pages`. ## Stats - **100,056** pages across **4,866** papers - Categories: - **cs.AI**: 25,015 pages - **cs.CV**: 25,011 pages - **cs.LG**: 25,029 pages - **stat.ML**: 25,001 pages ## Usage ```python from datasets import load_dataset ds = load_dataset("obswork/arxiv-ocr-benchmark-corpus", split="train", streaming=True) row = next(iter(ds)) print(row["arxiv_id"], row["page_no"], row["image"].size) ``` Each row has: `arxiv_id`, `primary_code`, `yymm`, `page_no`, `width`, `height`, `sha256`, `image`. Rows are sorted by `(primary_code, yymm, arxiv_id, page_no)` across shards. Root files: `manifest.json` (flat page index with `relpath` back to the volume-native layout), `manifest_meta.json` (counts + manifest sha256), and `image-plan.json` (paper-level render plan).
提供机构:
obswork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作