hltcoe/OJ4OCRMT

Name: hltcoe/OJ4OCRMT
Creator: hltcoe
Published: 2025-09-30 18:33:52
License: 暂无描述

Hugging Face2025-09-30 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/hltcoe/OJ4OCRMT

下载链接

链接失效反馈

官方服务：

资源简介：

OJ4OCRMT数据集是一个用于OCR-MT（光学字符识别-机器翻译）评估的大型多语言数据集。该数据集包含源PDF文件、三种分辨率的渲染图像以及文本文件（包括原始提取和句子边界划分的文件）。数据集分为两个部分：dev和test，每部分都包含超过1000页的内容，并提供23种欧盟语言版本的PDF、PNG和文本文件。dev部分包含2022年的1656页，其中1412页为常规内容，193页包含表格，51页包含图形。test部分包含2023年的1119页，其中979页为常规内容，98页包含表格，42页包含图形。所有2772页都有23种语言的翻译。这些语言包括：保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、英语、爱沙尼亚语、芬兰语、法语、德语、希腊语、匈牙利语、意大利语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、斯洛伐克语、斯洛文尼亚语、西班牙语和瑞典语。每个页面都通过文档标识符和页码进行标识，并提供了原始文本、标准化文本以及不同分辨率的PNG图像和PDF文件。

The OJ4OCRMT dataset is a large multilingual dataset designed for OCR-MT (Optical Character Recognition - Machine Translation) evaluation. It contains source PDF files, rendered images in three resolutions, and text files (both raw extractions and sentence-boundary split files). The dataset is divided into two partitions: dev and test, each containing over 1,000 pages of content in 23 EU languages. The dev partition includes 1,656 pages from 2022, with 1,412 pages of regular content, 193 pages containing tables, and 51 pages containing figures. The test partition includes 1,119 pages from 2023, with 979 pages of regular content, 98 pages containing tables, and 42 pages containing figures. All 2,772 pages are translated into 23 languages, including Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish. Each page is identified by a document identifier and a page number, and the dataset provides raw text, normalized text, and PNG images and PDF files at different resolutions.

提供机构：

hltcoe

5,000+

优质数据集

54 个

任务类型

进入经典数据集