Name: Arturito1/OCR-Synthetic-Multilingual-v1
Creator: Arturito1
Published: 2026-04-19 21:18:00
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/Arturito1/OCR-Synthetic-Multilingual-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - object-detection - image-to-text language: - en - ja - ko - ru - zh tags: - ocr - text-detection - text-recognition - synthetic-data - synthdog - hdf5 - nvidia - nemotron pretty_name: OCR Synthetic Multilingual v1 size_categories: - 10M<n<100M --- # OCR-Synthetic-Multilingual-v1 ## Overview Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al. This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection. ## Languages | Subfolder | Language | Total Samples | Train | Test | Validation | |-------------|------------------------|---------------|----------------|----------------|----------------| | `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) | | `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) | | `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) | | `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) | | `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) | | `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) | | **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > Numbers in parentheses are the number of `.h5` files per split. ## Related Model This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions. ## Directory Layout ``` OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ``` ## Format — HDF5 Each `.h5` file contains the following datasets (HDF5 terminology): | Key | Type | Description | |---------------|-------------------------------|-------------------------------------------------------| | `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample | | `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations | | `dimensions` | int array `[H, W]` | Original image dimensions | | `labels` | object (string) | Full-page text label | | `qualities` | int | JPEG quality used during encoding (typically 100) | | `sample_ids` | int | Unique sample identifier | ## Annotation JSON Schema Each entry in `annotations` is a JSON object: ```json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ``` ### Bounding Box Levels - **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`. - **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line). - **`para_bboxes`** — One entry per paragraph bounding box. - **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph. ### Quad Vertex Convention Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order: ``` v0 -------- v1 | | v3 -------- v2 ``` ## Loading Example ```python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ``` --- ## Per-Language Details ### English (`en`) | Property | Value | |----------------|--------------------------------------| | Language | English (`en`) | | Total Samples | 1,825,089 | | Train | 1,460,304 samples (63 files) | | Test | 183,629 samples (63 files) | | Validation | 181,156 samples (63 files) | ### Japanese (`ja`) | Property | Value | |----------------|--------------------------------------| | Language | Japanese (`ja`) | | Total Samples | 1,889,137 | | Train | 1,502,712 samples (67 files) | | Test | 193,779 samples (67 files) | | Validation | 192,646 samples (67 files) | ### Korean (`ko`) | Property | Value | |----------------|--------------------------------------| | Language | Korean (`ko`) | | Total Samples | 2,269,540 | | Train | 1,814,994 samples (78 files) | | Test | 227,091 samples (78 files) | | Validation | 227,455 samples (78 files) | ### Russian (`ru`) | Property | Value | |----------------|--------------------------------------| | Language | Russian (`ru`) | | Total Samples | 1,724,733 | | Train | 1,380,404 samples (59 files) | | Test | 171,678 samples (59 files) | | Validation | 172,651 samples (59 files) | ### Chinese Simplified (`zh_hans`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Simplified) (`zh_hans`) | | Total Samples | 2,335,343 | | Train | 1,914,948 samples (83 files) | | Test | 210,143 samples (73 files) | | Validation | 210,252 samples (73 files) | ### Chinese Traditional (`zh_hant`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Traditional) (`zh_hant`) | | Total Samples | 2,214,304 | | Train | 1,772,280 samples (77 files) | | Test | 221,867 samples (77 files) | | Validation | 220,157 samples (77 files) | ## Acknowledgements The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output. ## Citation If you use this dataset, please cite: ```bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} } ```

应用场景：