YuanHo/OCR-Synthetic-Multilingual-v1

Name: YuanHo/OCR-Synthetic-Multilingual-v1
Creator: YuanHo
Published: 2026-04-22 03:28:43
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/YuanHo/OCR-Synthetic-Multilingual-v1

下载链接

链接失效反馈

官方服务：

资源简介：

OCR-Synthetic-Multilingual-v1是一个大规模合成的多语言OCR训练数据集，用于文本检测和识别。该数据集是通过修改和扩展的SynthDoG（Synthetic Document Generator）生成的，支持英语、日语、韩语、俄语和中文（简体和繁体）等多种语言。数据集格式为HDF5，包含图像、注释、维度、标签、质量和样本ID等信息。注释部分详细描述了单词、行和段落的边界框以及阅读顺序图。数据集用于训练Nemotron OCR v2模型，并提供了每种语言的样本数量和文件分布。

OCR-Synthetic-Multilingual-v1 is a large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of SynthDoG (Synthetic Document Generator), supporting multiple languages including English, Japanese, Korean, Russian, and Chinese (Simplified and Traditional). The dataset is in HDF5 format, containing images, annotations, dimensions, labels, qualities, and sample IDs. The annotations detail word, line, and paragraph bounding boxes along with reading-order graphs. The dataset was used to train the Nemotron OCR v2 model and includes sample counts and file distributions for each language.

提供机构：

YuanHo

5,000+

优质数据集

54 个

任务类型

进入经典数据集