rootsautomation/pubmed-ocr
收藏Hugging Face2026-01-22 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/rootsautomation/pubmed-ocr
下载链接
链接失效反馈官方服务:
资源简介:
PubMed-OCR是一个基于PubMed Central开放获取PDF的科学文章的OCR中心语料库。每页被渲染为图像,并使用Google Cloud Vision OCR进行注释,以紧凑的JSON格式发布,包含单词、行和段落级别的边界框。该数据集旨在支持布局感知建模、基于坐标的问答以及对科学文档中OCR依赖流程的评估。数据集包含209.5K篇文章,约1.5M页,约1.3B单词(OCR标记)。数据单元为1行=1PDF页(通过{basename, page}唯一标识)。数据集主要语言为英语,每行包含源文章的许可证信息。
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page is rendered to an image and annotated with Google Cloud Vision OCR, released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. This dataset is intended to support layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines on scientific documents. The dataset includes 209.5K articles, ~1.5M pages, and ~1.3B words (OCR tokens). The data unit is 1 row = 1 PDF page (unique by {basename, page}). The primary language is English, and each row includes the source articles license information.
提供机构:
rootsautomation



