AtharvImmverse/OCR-Synthetic-Multilingual-v1

Name: AtharvImmverse/OCR-Synthetic-Multilingual-v1
Creator: AtharvImmverse
Published: 2026-04-22 06:12:13
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/AtharvImmverse/OCR-Synthetic-Multilingual-v1

下载链接

链接失效反馈

官方服务：

资源简介：

OCR Synthetic Multilingual v1是一个大规模合成的多语言OCR训练数据集，用于文本检测和识别。该数据集通过修改和扩展的SynthDoG（Synthetic Document Generator）生成，支持英语、日语、韩语、俄语、简体中文和繁体中文。数据集格式为HDF5，包含图像、注释、维度、标签、质量和样本ID等信息。注释包括单词、行和段落级别的边界框以及阅读顺序图。数据集用于训练Nemotron OCR v2模型，并提供了详细的加载示例和每种语言的样本数量统计。

OCR Synthetic Multilingual v1 is a large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of SynthDoG (Synthetic Document Generator), supporting English, Japanese, Korean, Russian, Simplified Chinese, and Traditional Chinese. The dataset is in HDF5 format, containing images, annotations, dimensions, labels, qualities, and sample IDs. Annotations include word, line, and paragraph-level bounding boxes along with reading-order graphs. The dataset was used to train Nemotron OCR v2 and includes detailed loading examples and sample counts per language.

提供机构：

AtharvImmverse

5,000+

优质数据集

54 个

任务类型

进入经典数据集