five

Nishant2414/OCR-Synthetic-Multilingual-v1

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Nishant2414/OCR-Synthetic-Multilingual-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - object-detection - image-to-text language: - en - ja - ko - ru - zh tags: - ocr - text-detection - text-recognition - synthetic-data - synthdog - hdf5 - nvidia - nemotron pretty_name: OCR Synthetic Multilingual v1 size_categories: - 10M<n<100M --- # OCR-Synthetic-Multilingual-v1 ## Overview Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al. This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection. ## Languages | Subfolder | Language | Total Samples | Train | Test | Validation | |-------------|------------------------|---------------|----------------|----------------|----------------| | `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) | | `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) | | `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) | | `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) | | `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) | | `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) | | **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > Numbers in parentheses are the number of `.h5` files per split. ## Related Model This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions. ## Directory Layout ``` OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ``` ## Format — HDF5 Each `.h5` file contains the following datasets (HDF5 terminology): | Key | Type | Description | |---------------|-------------------------------|-------------------------------------------------------| | `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample | | `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations | | `dimensions` | int array `[H, W]` | Original image dimensions | | `labels` | object (string) | Full-page text label | | `qualities` | int | JPEG quality used during encoding (typically 100) | | `sample_ids` | int | Unique sample identifier | ## Annotation JSON Schema Each entry in `annotations` is a JSON object: ```json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ``` ### Bounding Box Levels - **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`. - **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line). - **`para_bboxes`** — One entry per paragraph bounding box. - **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph. ### Quad Vertex Convention Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order: ``` v0 -------- v1 | | v3 -------- v2 ``` ## Loading Example ```python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ``` --- ## Per-Language Details ### English (`en`) | Property | Value | |----------------|--------------------------------------| | Language | English (`en`) | | Total Samples | 1,825,089 | | Train | 1,460,304 samples (63 files) | | Test | 183,629 samples (63 files) | | Validation | 181,156 samples (63 files) | ### Japanese (`ja`) | Property | Value | |----------------|--------------------------------------| | Language | Japanese (`ja`) | | Total Samples | 1,889,137 | | Train | 1,502,712 samples (67 files) | | Test | 193,779 samples (67 files) | | Validation | 192,646 samples (67 files) | ### Korean (`ko`) | Property | Value | |----------------|--------------------------------------| | Language | Korean (`ko`) | | Total Samples | 2,269,540 | | Train | 1,814,994 samples (78 files) | | Test | 227,091 samples (78 files) | | Validation | 227,455 samples (78 files) | ### Russian (`ru`) | Property | Value | |----------------|--------------------------------------| | Language | Russian (`ru`) | | Total Samples | 1,724,733 | | Train | 1,380,404 samples (59 files) | | Test | 171,678 samples (59 files) | | Validation | 172,651 samples (59 files) | ### Chinese Simplified (`zh_hans`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Simplified) (`zh_hans`) | | Total Samples | 2,335,343 | | Train | 1,914,948 samples (83 files) | | Test | 210,143 samples (73 files) | | Validation | 210,252 samples (73 files) | ### Chinese Traditional (`zh_hant`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Traditional) (`zh_hant`) | | Total Samples | 2,214,304 | | Train | 1,772,280 samples (77 files) | | Test | 221,867 samples (77 files) | | Validation | 220,157 samples (77 files) | ## Acknowledgements The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output. ## Citation If you use this dataset, please cite: ```bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} } ```

--- license: CC BY 4.0 task_categories: - 目标检测 - 图像到文本 language: - 英语 - 日语 - 韩语 - 俄语 - 中文 tags: - 光学字符识别(OCR) - 文本检测 - 文本识别 - 合成数据 - SynthDoG - HDF5 - NVIDIA - Nemotron pretty_name: 多语言合成OCR数据集v1 size_categories: - 1000万<样本数<1亿 --- # 多语言合成OCR数据集v1 ## 概述 本数据集为面向多语言文本检测与识别任务打造的大规模合成OCR训练数据集。数据生成流程基于[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog)(合成文档生成器,Synthetic Document Generator)的深度改造与扩展版本,该工具最初由Kim等人在[Donut](https://github.com/clovaai/donut)项目中提出。 本数据集曾用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)——一款顶尖的多语言OCR模型,隶属于[NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) 工具集。 ## 语言分布 | 子文件夹 | 语言 | 总样本数 | 训练集 | 测试集 | 验证集 | |---------|-----|---------|-------|-------|-------| | `en` | 英语 | 1,825,089 | 1,460,304(63) | 183,629(63) | 181,156(63) | | `ja` | 日语 | 1,889,137 | 1,502,712(67) | 193,779(67) | 192,646(67) | | `ko` | 韩语 | 2,269,540 | 1,814,994(78) | 227,091(78) | 227,455(78) | | `ru` | 俄语 | 1,724,733 | 1,380,404(59) | 171,678(59) | 172,651(59) | | `zh_hans` | 简体中文 | 2,335,343 | 1,914,948(83) | 210,143(73) | 210,252(73) | | `zh_hant` | 繁体中文 | 2,214,304 | 1,772,280(77) | 221,867(77) | 220,157(77) | | **总计** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > 括号内的数字为各拆分集对应的`.h5`文件数量。 ## 相关模型 本数据集用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)的检测、识别与关系组件。如需了解模型架构、评估结果与使用说明,请参阅该模型的卡片文档。 ## 目录结构 OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ## 格式 — HDF5 每个`.h5`文件包含以下HDF5数据集: | 键名 | 数据类型 | 描述 | |-----|---------|-----| | `images` | 对象(可变长度字节) | JPEG编码的图像字节,每个样本对应一条记录 | | `annotations` | 对象(可变长度字符串) | 每个样本对应一个JSON字符串,包含边界框标注信息 | | `dimensions` | 整数数组 `[H, W]` | 原始图像尺寸 | | `labels` | 对象(字符串) | 整页文本标签 | | `qualities` | 整数 | 编码时使用的JPEG质量(通常为100) | | `sample_ids` | 整数 | 唯一样本标识符 | ## 标注JSON Schema `annotations`中的每个条目均为JSON对象: json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ### 边界框层级 - **`word_bboxes`** — 每个单词/短语对应一条记录,包含`text`(文本内容)、轴对齐的`bbox [x, y, w, h]`以及4点`quad`(四边形)。 - **`line_bboxes`** — 每个文本行对应一条记录,包含所有`word_bboxes`的字段,额外包含`para_idx`(段落索引)、`line_idx`(段落内的行索引)以及`word_indices`(组成该行的`word_bboxes`索引)。 - **`para_bboxes`** — 每个段落边界框对应一条记录。 - **`relation_graph`** — 嵌套列表,编码阅读顺序:`relation_graph[para][sentence]`将返回属于该段落内句子的单词/行索引列表。 ### 四边形顶点约定 四边形为4点多边形,以顺时针顺序存储为`[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]`,格式如下: v0 -------- v1 | | v3 -------- v2 ## 加载示例 python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ## 各语言详细信息 ### 英语(`en`) | 属性 | 值 | |-----|-----| | 语言 | 英语(`en`) | | 总样本数 | 1,825,089 | | 训练集 | 1,460,304 个样本(63个文件) | | 测试集 | 183,629 个样本(63个文件) | | 验证集 | 181,156 个样本(63个文件) | ### 日语(`ja`) | 属性 | 值 | |-----|-----| | 语言 | 日语(`ja`) | | 总样本数 | 1,889,137 | | 训练集 | 1,502,712 个样本(67个文件) | | 测试集 | 193,779 个样本(67个文件) | | 验证集 | 192,646 个样本(67个文件) | ### 韩语(`ko`) | 属性 | 值 | |-----|-----| | 语言 | 韩语(`ko`) | | 总样本数 | 2,269,540 | | 训练集 | 1,814,994 个样本(78个文件) | | 测试集 | 227,091 个样本(78个文件) | | 验证集 | 227,455 个样本(78个文件) | ### 俄语(`ru`) | 属性 | 值 | |-----|-----| | 语言 | 俄语(`ru`) | | 总样本数 | 1,724,733 | | 训练集 | 1,380,404 个样本(59个文件) | | 测试集 | 171,678 个样本(59个文件) | | 验证集 | 172,651 个样本(59个文件) | ### 简体中文(`zh_hans`) | 属性 | 值 | |-----|-----| | 语言 | 简体中文(`zh_hans`) | | 总样本数 | 2,335,343 | | 训练集 | 1,914,948 个样本(83个文件) | | 测试集 | 210,143 个样本(73个文件) | | 验证集 | 210,252 个样本(73个文件) | ### 繁体中文(`zh_hant`) | 属性 | 值 | |-----|-----| | 语言 | 繁体中文(`zh_hant`) | | 总样本数 | 2,214,304 | | 训练集 | 1,772,280 个样本(77个文件) | | 测试集 | 221,867 个样本(77个文件) | | 验证集 | 220,157 个样本(77个文件) | ## 致谢 本合成数据生成流程基于Donut项目中的[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog),并经过大幅修改以支持更多语言、自定义渲染效果、结构化边界框标注(支持单词/行/段落层级与阅读顺序图),以及HDF5输出格式。 ## 引用 若使用本数据集,请引用以下文献: bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} }
提供机构:
Nishant2414
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作