Nishant2414/OCR-Synthetic-Multilingual-v1
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Nishant2414/OCR-Synthetic-Multilingual-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- object-detection
- image-to-text
language:
- en
- ja
- ko
- ru
- zh
tags:
- ocr
- text-detection
- text-recognition
- synthetic-data
- synthdog
- hdf5
- nvidia
- nemotron
pretty_name: OCR Synthetic Multilingual v1
size_categories:
- 10M<n<100M
---
# OCR-Synthetic-Multilingual-v1
## Overview
Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al.
This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection.
## Languages
| Subfolder | Language | Total Samples | Train | Test | Validation |
|-------------|------------------------|---------------|----------------|----------------|----------------|
| `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) |
| `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) |
| `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) |
| `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) |
| `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) |
| `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) |
| **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** |
> Numbers in parentheses are the number of `.h5` files per split.
## Related Model
This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions.
## Directory Layout
```
OCR-Synthetic-Multilingual-v1/
├── en/
│ ├── train/
│ │ ├── train_000.h5
│ │ ├── train_001.h5
│ │ └── ...
│ ├── test/
│ │ └── ...
│ └── validation/
│ └── ...
├── ja/
│ └── ...
├── ko/
│ └── ...
├── ru/
│ └── ...
├── zh_hans/
│ └── ...
└── zh_hant/
└── ...
```
## Format — HDF5
Each `.h5` file contains the following datasets (HDF5 terminology):
| Key | Type | Description |
|---------------|-------------------------------|-------------------------------------------------------|
| `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample |
| `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations |
| `dimensions` | int array `[H, W]` | Original image dimensions |
| `labels` | object (string) | Full-page text label |
| `qualities` | int | JPEG quality used during encoding (typically 100) |
| `sample_ids` | int | Unique sample identifier |
## Annotation JSON Schema
Each entry in `annotations` is a JSON object:
```json
{
"word_bboxes": [
{
"text": "example word or phrase",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
}
],
"line_bboxes": [
{
"text": "full line of text",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]],
"para_idx": 0,
"line_idx": 0,
"word_indices": [0, 1, 2]
}
],
"para_bboxes": [...],
"relation_graph": [
[[0], [1], [2]],
[[3], [4]]
]
}
```
### Bounding Box Levels
- **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`.
- **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line).
- **`para_bboxes`** — One entry per paragraph bounding box.
- **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph.
### Quad Vertex Convention
Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order:
```
v0 -------- v1
| |
v3 -------- v2
```
## Loading Example
```python
import h5py, io, json
from PIL import Image
with h5py.File("en/train/train_000.h5", "r") as f:
img_bytes = f["images"][0]
image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB")
annotation = json.loads(f["annotations"][0])
for line in annotation["line_bboxes"]:
print(line["text"], line["quad"])
```
---
## Per-Language Details
### English (`en`)
| Property | Value |
|----------------|--------------------------------------|
| Language | English (`en`) |
| Total Samples | 1,825,089 |
| Train | 1,460,304 samples (63 files) |
| Test | 183,629 samples (63 files) |
| Validation | 181,156 samples (63 files) |
### Japanese (`ja`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Japanese (`ja`) |
| Total Samples | 1,889,137 |
| Train | 1,502,712 samples (67 files) |
| Test | 193,779 samples (67 files) |
| Validation | 192,646 samples (67 files) |
### Korean (`ko`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Korean (`ko`) |
| Total Samples | 2,269,540 |
| Train | 1,814,994 samples (78 files) |
| Test | 227,091 samples (78 files) |
| Validation | 227,455 samples (78 files) |
### Russian (`ru`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Russian (`ru`) |
| Total Samples | 1,724,733 |
| Train | 1,380,404 samples (59 files) |
| Test | 171,678 samples (59 files) |
| Validation | 172,651 samples (59 files) |
### Chinese Simplified (`zh_hans`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Chinese (Simplified) (`zh_hans`) |
| Total Samples | 2,335,343 |
| Train | 1,914,948 samples (83 files) |
| Test | 210,143 samples (73 files) |
| Validation | 210,252 samples (73 files) |
### Chinese Traditional (`zh_hant`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Chinese (Traditional) (`zh_hant`) |
| Total Samples | 2,214,304 |
| Train | 1,772,280 samples (77 files) |
| Test | 221,867 samples (77 files) |
| Validation | 220,157 samples (77 files) |
## Acknowledgements
The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{chesler2026ocr_synthetic_multilingual,
title = {{OCR-Synthetic-Multilingual-v1}},
author = {Chesler, Ryan},
year = {2026},
publisher = {NVIDIA},
url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1},
note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline}
}
```
---
license: CC BY 4.0
task_categories:
- 目标检测
- 图像到文本
language:
- 英语
- 日语
- 韩语
- 俄语
- 中文
tags:
- 光学字符识别(OCR)
- 文本检测
- 文本识别
- 合成数据
- SynthDoG
- HDF5
- NVIDIA
- Nemotron
pretty_name: 多语言合成OCR数据集v1
size_categories:
- 1000万<样本数<1亿
---
# 多语言合成OCR数据集v1
## 概述
本数据集为面向多语言文本检测与识别任务打造的大规模合成OCR训练数据集。数据生成流程基于[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog)(合成文档生成器,Synthetic Document Generator)的深度改造与扩展版本,该工具最初由Kim等人在[Donut](https://github.com/clovaai/donut)项目中提出。
本数据集曾用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)——一款顶尖的多语言OCR模型,隶属于[NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) 工具集。
## 语言分布
| 子文件夹 | 语言 | 总样本数 | 训练集 | 测试集 | 验证集 |
|---------|-----|---------|-------|-------|-------|
| `en` | 英语 | 1,825,089 | 1,460,304(63) | 183,629(63) | 181,156(63) |
| `ja` | 日语 | 1,889,137 | 1,502,712(67) | 193,779(67) | 192,646(67) |
| `ko` | 韩语 | 2,269,540 | 1,814,994(78) | 227,091(78) | 227,455(78) |
| `ru` | 俄语 | 1,724,733 | 1,380,404(59) | 171,678(59) | 172,651(59) |
| `zh_hans` | 简体中文 | 2,335,343 | 1,914,948(83) | 210,143(73) | 210,252(73) |
| `zh_hant` | 繁体中文 | 2,214,304 | 1,772,280(77) | 221,867(77) | 220,157(77) |
| **总计** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** |
> 括号内的数字为各拆分集对应的`.h5`文件数量。
## 相关模型
本数据集用于训练[**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2)的检测、识别与关系组件。如需了解模型架构、评估结果与使用说明,请参阅该模型的卡片文档。
## 目录结构
OCR-Synthetic-Multilingual-v1/
├── en/
│ ├── train/
│ │ ├── train_000.h5
│ │ ├── train_001.h5
│ │ └── ...
│ ├── test/
│ │ └── ...
│ └── validation/
│ └── ...
├── ja/
│ └── ...
├── ko/
│ └── ...
├── ru/
│ └── ...
├── zh_hans/
│ └── ...
└── zh_hant/
└── ...
## 格式 — HDF5
每个`.h5`文件包含以下HDF5数据集:
| 键名 | 数据类型 | 描述 |
|-----|---------|-----|
| `images` | 对象(可变长度字节) | JPEG编码的图像字节,每个样本对应一条记录 |
| `annotations` | 对象(可变长度字符串) | 每个样本对应一个JSON字符串,包含边界框标注信息 |
| `dimensions` | 整数数组 `[H, W]` | 原始图像尺寸 |
| `labels` | 对象(字符串) | 整页文本标签 |
| `qualities` | 整数 | 编码时使用的JPEG质量(通常为100) |
| `sample_ids` | 整数 | 唯一样本标识符 |
## 标注JSON Schema
`annotations`中的每个条目均为JSON对象:
json
{
"word_bboxes": [
{
"text": "example word or phrase",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
}
],
"line_bboxes": [
{
"text": "full line of text",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]],
"para_idx": 0,
"line_idx": 0,
"word_indices": [0, 1, 2]
}
],
"para_bboxes": [...],
"relation_graph": [
[[0], [1], [2]],
[[3], [4]]
]
}
### 边界框层级
- **`word_bboxes`** — 每个单词/短语对应一条记录,包含`text`(文本内容)、轴对齐的`bbox [x, y, w, h]`以及4点`quad`(四边形)。
- **`line_bboxes`** — 每个文本行对应一条记录,包含所有`word_bboxes`的字段,额外包含`para_idx`(段落索引)、`line_idx`(段落内的行索引)以及`word_indices`(组成该行的`word_bboxes`索引)。
- **`para_bboxes`** — 每个段落边界框对应一条记录。
- **`relation_graph`** — 嵌套列表,编码阅读顺序:`relation_graph[para][sentence]`将返回属于该段落内句子的单词/行索引列表。
### 四边形顶点约定
四边形为4点多边形,以顺时针顺序存储为`[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]`,格式如下:
v0 -------- v1
| |
v3 -------- v2
## 加载示例
python
import h5py, io, json
from PIL import Image
with h5py.File("en/train/train_000.h5", "r") as f:
img_bytes = f["images"][0]
image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB")
annotation = json.loads(f["annotations"][0])
for line in annotation["line_bboxes"]:
print(line["text"], line["quad"])
## 各语言详细信息
### 英语(`en`)
| 属性 | 值 |
|-----|-----|
| 语言 | 英语(`en`) |
| 总样本数 | 1,825,089 |
| 训练集 | 1,460,304 个样本(63个文件) |
| 测试集 | 183,629 个样本(63个文件) |
| 验证集 | 181,156 个样本(63个文件) |
### 日语(`ja`)
| 属性 | 值 |
|-----|-----|
| 语言 | 日语(`ja`) |
| 总样本数 | 1,889,137 |
| 训练集 | 1,502,712 个样本(67个文件) |
| 测试集 | 193,779 个样本(67个文件) |
| 验证集 | 192,646 个样本(67个文件) |
### 韩语(`ko`)
| 属性 | 值 |
|-----|-----|
| 语言 | 韩语(`ko`) |
| 总样本数 | 2,269,540 |
| 训练集 | 1,814,994 个样本(78个文件) |
| 测试集 | 227,091 个样本(78个文件) |
| 验证集 | 227,455 个样本(78个文件) |
### 俄语(`ru`)
| 属性 | 值 |
|-----|-----|
| 语言 | 俄语(`ru`) |
| 总样本数 | 1,724,733 |
| 训练集 | 1,380,404 个样本(59个文件) |
| 测试集 | 171,678 个样本(59个文件) |
| 验证集 | 172,651 个样本(59个文件) |
### 简体中文(`zh_hans`)
| 属性 | 值 |
|-----|-----|
| 语言 | 简体中文(`zh_hans`) |
| 总样本数 | 2,335,343 |
| 训练集 | 1,914,948 个样本(83个文件) |
| 测试集 | 210,143 个样本(73个文件) |
| 验证集 | 210,252 个样本(73个文件) |
### 繁体中文(`zh_hant`)
| 属性 | 值 |
|-----|-----|
| 语言 | 繁体中文(`zh_hant`) |
| 总样本数 | 2,214,304 |
| 训练集 | 1,772,280 个样本(77个文件) |
| 测试集 | 221,867 个样本(77个文件) |
| 验证集 | 220,157 个样本(77个文件) |
## 致谢
本合成数据生成流程基于Donut项目中的[SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog),并经过大幅修改以支持更多语言、自定义渲染效果、结构化边界框标注(支持单词/行/段落层级与阅读顺序图),以及HDF5输出格式。
## 引用
若使用本数据集,请引用以下文献:
bibtex
@misc{chesler2026ocr_synthetic_multilingual,
title = {{OCR-Synthetic-Multilingual-v1}},
author = {Chesler, Ryan},
year = {2026},
publisher = {NVIDIA},
url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1},
note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline}
}
提供机构:
Nishant2414



