caveman273/theseus_ocr_tiny
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/caveman273/theseus_ocr_tiny
下载链接
链接失效反馈官方服务:
资源简介:
Theseus芬兰OCR Tiny数据集是一个从芬兰应用科学大学论文库Theseus.fi中提取的段落级OCR数据集。每个记录包含从论文PDF中提取的一个段落裁剪图像和通过`pdfplumber`工具提取的对应文本。图像分辨率为300 DPI,每边有2像素的填充。数据集包含16618个训练示例,适用于OCR和文档理解模型的训练。数据字段包括源PDF文件名、页码、段落裁剪图像和提取的文本。数据来源于Theseus OAI-PMH端点和DSpace bitstream API。
The Theseus Finnish OCR Tiny dataset is a paragraph-level OCR dataset harvested from the Finnish repository of university of applied sciences theses, Theseus.fi. Each record consists of a paragraph crop extracted from a thesis PDF, paired with the text extracted by `pdfplumber`. The images are rendered at 300 DPI with 2 px padding on each side. The dataset contains 16,618 training examples and is suitable for training OCR and document-understanding models. Fields include the source PDF filename, page number, paragraph crop image, and extracted text. The data was sourced from the Theseus OAI-PMH endpoint and the DSpace bitstream API.
提供机构:
caveman273



