Arturito1/OCR-Synthetic-Multilingual-v1
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Arturito1/OCR-Synthetic-Multilingual-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- object-detection
- image-to-text
language:
- en
- ja
- ko
- ru
- zh
tags:
- ocr
- text-detection
- text-recognition
- synthetic-data
- synthdog
- hdf5
- nvidia
- nemotron
pretty_name: OCR Synthetic Multilingual v1
size_categories:
- 10M<n<100M
---
# OCR-Synthetic-Multilingual-v1
## Overview
Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al.
This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection.
## Languages
| Subfolder | Language | Total Samples | Train | Test | Validation |
|-------------|------------------------|---------------|----------------|----------------|----------------|
| `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) |
| `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) |
| `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) |
| `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) |
| `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) |
| `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) |
| **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** |
> Numbers in parentheses are the number of `.h5` files per split.
## Related Model
This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions.
## Directory Layout
```
OCR-Synthetic-Multilingual-v1/
├── en/
│ ├── train/
│ │ ├── train_000.h5
│ │ ├── train_001.h5
│ │ └── ...
│ ├── test/
│ │ └── ...
│ └── validation/
│ └── ...
├── ja/
│ └── ...
├── ko/
│ └── ...
├── ru/
│ └── ...
├── zh_hans/
│ └── ...
└── zh_hant/
└── ...
```
## Format — HDF5
Each `.h5` file contains the following datasets (HDF5 terminology):
| Key | Type | Description |
|---------------|-------------------------------|-------------------------------------------------------|
| `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample |
| `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations |
| `dimensions` | int array `[H, W]` | Original image dimensions |
| `labels` | object (string) | Full-page text label |
| `qualities` | int | JPEG quality used during encoding (typically 100) |
| `sample_ids` | int | Unique sample identifier |
## Annotation JSON Schema
Each entry in `annotations` is a JSON object:
```json
{
"word_bboxes": [
{
"text": "example word or phrase",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
}
],
"line_bboxes": [
{
"text": "full line of text",
"bbox": [x, y, w, h],
"quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]],
"para_idx": 0,
"line_idx": 0,
"word_indices": [0, 1, 2]
}
],
"para_bboxes": [...],
"relation_graph": [
[[0], [1], [2]],
[[3], [4]]
]
}
```
### Bounding Box Levels
- **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`.
- **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line).
- **`para_bboxes`** — One entry per paragraph bounding box.
- **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph.
### Quad Vertex Convention
Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order:
```
v0 -------- v1
| |
v3 -------- v2
```
## Loading Example
```python
import h5py, io, json
from PIL import Image
with h5py.File("en/train/train_000.h5", "r") as f:
img_bytes = f["images"][0]
image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB")
annotation = json.loads(f["annotations"][0])
for line in annotation["line_bboxes"]:
print(line["text"], line["quad"])
```
---
## Per-Language Details
### English (`en`)
| Property | Value |
|----------------|--------------------------------------|
| Language | English (`en`) |
| Total Samples | 1,825,089 |
| Train | 1,460,304 samples (63 files) |
| Test | 183,629 samples (63 files) |
| Validation | 181,156 samples (63 files) |
### Japanese (`ja`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Japanese (`ja`) |
| Total Samples | 1,889,137 |
| Train | 1,502,712 samples (67 files) |
| Test | 193,779 samples (67 files) |
| Validation | 192,646 samples (67 files) |
### Korean (`ko`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Korean (`ko`) |
| Total Samples | 2,269,540 |
| Train | 1,814,994 samples (78 files) |
| Test | 227,091 samples (78 files) |
| Validation | 227,455 samples (78 files) |
### Russian (`ru`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Russian (`ru`) |
| Total Samples | 1,724,733 |
| Train | 1,380,404 samples (59 files) |
| Test | 171,678 samples (59 files) |
| Validation | 172,651 samples (59 files) |
### Chinese Simplified (`zh_hans`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Chinese (Simplified) (`zh_hans`) |
| Total Samples | 2,335,343 |
| Train | 1,914,948 samples (83 files) |
| Test | 210,143 samples (73 files) |
| Validation | 210,252 samples (73 files) |
### Chinese Traditional (`zh_hant`)
| Property | Value |
|----------------|--------------------------------------|
| Language | Chinese (Traditional) (`zh_hant`) |
| Total Samples | 2,214,304 |
| Train | 1,772,280 samples (77 files) |
| Test | 221,867 samples (77 files) |
| Validation | 220,157 samples (77 files) |
## Acknowledgements
The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{chesler2026ocr_synthetic_multilingual,
title = {{OCR-Synthetic-Multilingual-v1}},
author = {Chesler, Ryan},
year = {2026},
publisher = {NVIDIA},
url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1},
note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline}
}
```
提供机构:
Arturito1



