Name: Nishant2414/OCR-Synthetic-Multilingual-v1
Creator: Nishant2414
Published: 2026-04-19 13:25:03
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/Nishant2414/OCR-Synthetic-Multilingual-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - object-detection - image-to-text language: - en - ja - ko - ru - zh tags: - ocr - text-detection - text-recognition - synthetic-data - synthdog - hdf5 - nvidia - nemotron pretty_name: OCR Synthetic Multilingual v1 size_categories: - 10M<n<100M --- # OCR-Synthetic-Multilingual-v1 ## Overview Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al. This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection. ## Languages | Subfolder | Language | Total Samples | Train | Test | Validation | |-------------|------------------------|---------------|----------------|----------------|----------------| | `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) | | `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) | | `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) | | `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) | | `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) | | `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) | | **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > Numbers in parentheses are the number of `.h5` files per split. ## Related Model This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions. ## Directory Layout ``` OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ``` ## Format — HDF5 Each `.h5` file contains the following datasets (HDF5 terminology): | Key | Type | Description | |---------------|-------------------------------|-------------------------------------------------------| | `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample | | `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations | | `dimensions` | int array `[H, W]` | Original image dimensions | | `labels` | object (string) | Full-page text label | | `qualities` | int | JPEG quality used during encoding (typically 100) | | `sample_ids` | int | Unique sample identifier | ## Annotation JSON Schema Each entry in `annotations` is a JSON object: ```json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ``` ### Bounding Box Levels - **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`. - **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line). - **`para_bboxes`** — One entry per paragraph bounding box. - **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph. ### Quad Vertex Convention Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order: ``` v0 -------- v1 | | v3 -------- v2 ``` ## Loading Example ```python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ``` --- ## Per-Language Details ### English (`en`) | Property | Value | |----------------|--------------------------------------| | Language | English (`en`) | | Total Samples | 1,825,089 | | Train | 1,460,304 samples (63 files) | | Test | 183,629 samples (63 files) | | Validation | 181,156 samples (63 files) | ### Japanese (`ja`) | Property | Value | |----------------|--------------------------------------| | Language | Japanese (`ja`) | | Total Samples | 1,889,137 | | Train | 1,502,712 samples (67 files) | | Test | 193,779 samples (67 files) | | Validation | 192,646 samples (67 files) | ### Korean (`ko`) | Property | Value | |----------------|--------------------------------------| | Language | Korean (`ko`) | | Total Samples | 2,269,540 | | Train | 1,814,994 samples (78 files) | | Test | 227,091 samples (78 files) | | Validation | 227,455 samples (78 files) | ### Russian (`ru`) | Property | Value | |----------------|--------------------------------------| | Language | Russian (`ru`) | | Total Samples | 1,724,733 | | Train | 1,380,404 samples (59 files) | | Test | 171,678 samples (59 files) | | Validation | 172,651 samples (59 files) | ### Chinese Simplified (`zh_hans`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Simplified) (`zh_hans`) | | Total Samples | 2,335,343 | | Train | 1,914,948 samples (83 files) | | Test | 210,143 samples (73 files) | | Validation | 210,252 samples (73 files) | ### Chinese Traditional (`zh_hant`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Traditional) (`zh_hant`) | | Total Samples | 2,214,304 | | Train | 1,772,280 samples (77 files) | | Test | 221,867 samples (77 files) | | Validation | 220,157 samples (77 files) | ## Acknowledgements The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output. ## Citation If you use this dataset, please cite: ```bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} } ```

--- license: CC BY 4.0 task_categories: - 目标检测 - 图像到文本 language: - 英语 - 日语 - 韩语 - 俄语 - 中文 tags: - 光学字符识别（OCR） - 文本检测 - 文本识别 - 合成数据 - SynthDoG - HDF5 - NVIDIA - Nemotron pretty_name: 多语言合成OCR数据集v1 size_categories: - 1000万<样本数<1亿 --- # 多语言合成OCR数据集v1 ## 概述本数据集为面向多语言文本检测与识别任务打造的大规模合成OCR训练数据集。数据生成流程基于[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog)（合成文档生成器，Synthetic Document Generator）的深度改造与扩展版本，该工具最初由Kim等人在[Donut](https://github.com/clovaai/donut)项目中提出。本数据集曾用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)——一款顶尖的多语言OCR模型，隶属于[NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) 工具集。 ## 语言分布 | 子文件夹 | 语言 | 总样本数 | 训练集 | 测试集 | 验证集 | |---------|-----|---------|-------|-------|-------| | `en` | 英语 | 1,825,089 | 1,460,304（63） | 183,629（63） | 181,156（63） | | `ja` | 日语 | 1,889,137 | 1,502,712（67） | 193,779（67） | 192,646（67） | | `ko` | 韩语 | 2,269,540 | 1,814,994（78） | 227,091（78） | 227,455（78） | | `ru` | 俄语 | 1,724,733 | 1,380,404（59） | 171,678（59） | 172,651（59） | | `zh_hans` | 简体中文 | 2,335,343 | 1,914,948（83） | 210,143（73） | 210,252（73） | | `zh_hant` | 繁体中文 | 2,214,304 | 1,772,280（77） | 221,867（77） | 220,157（77） | | **总计** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > 括号内的数字为各拆分集对应的`.h5`文件数量。 ## 相关模型本数据集用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)的检测、识别与关系组件。如需了解模型架构、评估结果与使用说明，请参阅该模型的卡片文档。 ## 目录结构 OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ## 格式 — HDF5 每个`.h5`文件包含以下HDF5数据集： | 键名 | 数据类型 | 描述 | |-----|---------|-----| | `images` | 对象（可变长度字节） | JPEG编码的图像字节，每个样本对应一条记录 | | `annotations` | 对象（可变长度字符串） | 每个样本对应一个JSON字符串，包含边界框标注信息 | | `dimensions` | 整数数组 `[H, W]` | 原始图像尺寸 | | `labels` | 对象（字符串） | 整页文本标签 | | `qualities` | 整数 | 编码时使用的JPEG质量（通常为100） | | `sample_ids` | 整数 | 唯一样本标识符 | ## 标注JSON Schema `annotations`中的每个条目均为JSON对象： json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ### 边界框层级 - **`word_bboxes`** — 每个单词/短语对应一条记录，包含`text`（文本内容）、轴对齐的`bbox [x, y, w, h]`以及4点`quad`（四边形）。 - **`line_bboxes`** — 每个文本行对应一条记录，包含所有`word_bboxes`的字段，额外包含`para_idx`（段落索引）、`line_idx`（段落内的行索引）以及`word_indices`（组成该行的`word_bboxes`索引）。 - **`para_bboxes`** — 每个段落边界框对应一条记录。 - **`relation_graph`** — 嵌套列表，编码阅读顺序：`relation_graph[para][sentence]`将返回属于该段落内句子的单词/行索引列表。 ### 四边形顶点约定四边形为4点多边形，以顺时针顺序存储为`[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]`，格式如下： v0 -------- v1 | | v3 -------- v2 ## 加载示例 python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ## 各语言详细信息 ### 英语（`en`） | 属性 | 值 | |-----|-----| | 语言 | 英语（`en`） | | 总样本数 | 1,825,089 | | 训练集 | 1,460,304 个样本（63个文件） | | 测试集 | 183,629 个样本（63个文件） | | 验证集 | 181,156 个样本（63个文件） | ### 日语（`ja`） | 属性 | 值 | |-----|-----| | 语言 | 日语（`ja`） | | 总样本数 | 1,889,137 | | 训练集 | 1,502,712 个样本（67个文件） | | 测试集 | 193,779 个样本（67个文件） | | 验证集 | 192,646 个样本（67个文件） | ### 韩语（`ko`） | 属性 | 值 | |-----|-----| | 语言 | 韩语（`ko`） | | 总样本数 | 2,269,540 | | 训练集 | 1,814,994 个样本（78个文件） | | 测试集 | 227,091 个样本（78个文件） | | 验证集 | 227,455 个样本（78个文件） | ### 俄语（`ru`） | 属性 | 值 | |-----|-----| | 语言 | 俄语（`ru`） | | 总样本数 | 1,724,733 | | 训练集 | 1,380,404 个样本（59个文件） | | 测试集 | 171,678 个样本（59个文件） | | 验证集 | 172,651 个样本（59个文件） | ### 简体中文（`zh_hans`） | 属性 | 值 | |-----|-----| | 语言 | 简体中文（`zh_hans`） | | 总样本数 | 2,335,343 | | 训练集 | 1,914,948 个样本（83个文件） | | 测试集 | 210,143 个样本（73个文件） | | 验证集 | 210,252 个样本（73个文件） | ### 繁体中文（`zh_hant`） | 属性 | 值 | |-----|-----| | 语言 | 繁体中文（`zh_hant`） | | 总样本数 | 2,214,304 | | 训练集 | 1,772,280 个样本（77个文件） | | 测试集 | 221,867 个样本（77个文件） | | 验证集 | 220,157 个样本（77个文件） | ## 致谢本合成数据生成流程基于Donut项目中的[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog)，并经过大幅修改以支持更多语言、自定义渲染效果、结构化边界框标注（支持单词/行/段落层级与阅读顺序图），以及HDF5输出格式。 ## 引用若使用本数据集，请引用以下文献： bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} }

应用场景：