five

OCR-Synthetic-Multilingual-v1

收藏
魔搭社区2026-04-28 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OCR-Synthetic-Multilingual-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# OCR-Synthetic-Multilingual-v1 ## Dataset Description Large-scale synthetically generated OCR training dataset for multilingual text detection and recognition. The data was produced using a heavily modified and extended version of [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) (Synthetic Document Generator), originally introduced in the [Donut](https://github.com/clovaai/donut) project by Kim et al. This dataset was used to train [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2), a state-of-the-art multilingual OCR model that is part of the [NVIDIA NeMo Retriever](https://developer.nvidia.com/nemo-retriever/) collection. This dataset is ready for commercial/non-commercial use. ## Dataset Owner(s): NVIDIA Corporation ## Dataset Creation Date: April 15, 2026 ## License/Terms of Use: Dataset Governing Terms: Use of the dataset is governed by the Creative Commons Attribution 4.0 International License (CC BY 4.0). ## Intended Usage: This dataset is intended for machine learning researchers, AI engineers, and developers working on information retrieval with OCR. ## Dataset Characterization ** Data Collection Method<br> * [Hybrid: Human, Automated, Synthetic] ** Labeling Method<br> * [Not Applicable] <br> ## Dataset Format ``` OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ``` ## Format — HDF5 Each `.h5` file contains the following datasets (HDF5 terminology): | Key | Type | Description | |---------------|-------------------------------|-------------------------------------------------------| | `images` | object (variable-length bytes)| JPEG-encoded image bytes, one entry per sample | | `annotations` | object (variable-length str) | JSON string per sample containing bounding-box annotations | | `dimensions` | int array `[H, W]` | Original image dimensions | | `labels` | object (string) | Full-page text label | | `qualities` | int | JPEG quality used during encoding (typically 100) | | `sample_ids` | int | Unique sample identifier | ## Annotation JSON Schema Each entry in `annotations` is a JSON object: ```json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ``` ### Bounding Box Levels - **`word_bboxes`** — One entry per word/phrase rendered as a single unit. Each contains `text`, an axis-aligned `bbox [x, y, w, h]`, and a 4-point `quad`. - **`line_bboxes`** — One entry per text line. Includes all `word_bboxes` fields plus `para_idx` (paragraph index), `line_idx` (line index within the paragraph), and `word_indices` (indices into `word_bboxes` that compose this line). - **`para_bboxes`** — One entry per paragraph bounding box. - **`relation_graph`** — Nested list encoding reading order: `relation_graph[para][sentence]` gives a list of word/line indices belonging to that sentence within the paragraph. ### Quad Vertex Convention Quads are 4-point polygons stored as `[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]` in clockwise order: ``` v0 -------- v1 | | v3 -------- v2 ``` ## Loading Example ```python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) ``` --- ## Dataset Quantification # Languages | Subfolder | Language | Total Samples | Train | Test | Validation | |-------------|------------------------|---------------|----------------|----------------|----------------| | `en` | English | 1,825,089 | 1,460,304 (63) | 183,629 (63) | 181,156 (63) | | `ja` | Japanese | 1,889,137 | 1,502,712 (67) | 193,779 (67) | 192,646 (67) | | `ko` | Korean | 2,269,540 | 1,814,994 (78) | 227,091 (78) | 227,455 (78) | | `ru` | Russian | 1,724,733 | 1,380,404 (59) | 171,678 (59) | 172,651 (59) | | `zh_hans` | Chinese (Simplified) | 2,335,343 | 1,914,948 (83) | 210,143 (73) | 210,252 (73) | | `zh_hant` | Chinese (Traditional) | 2,214,304 | 1,772,280 (77) | 221,867 (77) | 220,157 (77) | | **Total** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > Numbers in parentheses are the number of `.h5` files per split. ## Related Model This dataset was created to train the detection, recognition, and relational components of [**Nemotron OCR v2**](https://huggingface.co/nvidia/nemotron-ocr-v2). See the model card for architecture details, evaluation results, and usage instructions. ## Per-Language Details ### English (`en`) | Property | Value | |----------------|--------------------------------------| | Language | English (`en`) | | Total Samples | 1,825,089 | | Train | 1,460,304 samples (63 files) | | Test | 183,629 samples (63 files) | | Validation | 181,156 samples (63 files) | ### Japanese (`ja`) | Property | Value | |----------------|--------------------------------------| | Language | Japanese (`ja`) | | Total Samples | 1,889,137 | | Train | 1,502,712 samples (67 files) | | Test | 193,779 samples (67 files) | | Validation | 192,646 samples (67 files) | ### Korean (`ko`) | Property | Value | |----------------|--------------------------------------| | Language | Korean (`ko`) | | Total Samples | 2,269,540 | | Train | 1,814,994 samples (78 files) | | Test | 227,091 samples (78 files) | | Validation | 227,455 samples (78 files) | ### Russian (`ru`) | Property | Value | |----------------|--------------------------------------| | Language | Russian (`ru`) | | Total Samples | 1,724,733 | | Train | 1,380,404 samples (59 files) | | Test | 171,678 samples (59 files) | | Validation | 172,651 samples (59 files) | ### Chinese Simplified (`zh_hans`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Simplified) (`zh_hans`) | | Total Samples | 2,335,343 | | Train | 1,914,948 samples (83 files) | | Test | 210,143 samples (73 files) | | Validation | 210,252 samples (73 files) | ### Chinese Traditional (`zh_hant`) | Property | Value | |----------------|--------------------------------------| | Language | Chinese (Traditional) (`zh_hant`) | | Total Samples | 2,214,304 | | Train | 1,772,280 samples (77 files) | | Test | 221,867 samples (77 files) | | Validation | 220,157 samples (77 files) | Total Data Storage: 5.45TB ## Reference(s): The synthetic data generation pipeline is based on SynthDoG from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here. ## Acknowledgements The synthetic data generation pipeline is based on [SynthDoG](https://github.com/clovaai/donut/tree/master/synthdog) from the Donut project, with substantial modifications to support additional languages, custom rendering effects, structured bounding-box annotations (word/line/paragraph levels with reading-order graphs), and HDF5 output. ## Citation If you use this dataset, please cite: ```bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} } ```

# OCR-Synthetic-Multilingual-v1(多语言合成光学字符识别数据集v1) ## 数据集描述 大规模合成生成的多语言文本检测与识别用光学字符识别(Optical Character Recognition, OCR)训练数据集。本数据集基于由Kim等人在Donut项目中提出的SynthDoG(合成文档生成器,Synthetic Document Generator)经过大幅修改与扩展的版本生成。 本数据集用于训练**Nemotron OCR v2**——一款顶尖的多语言OCR模型,隶属于NVIDIA NeMo Retriever(NVIDIA NeMo检索器)套件。 本数据集可免费用于商业与非商业用途。 ## 数据集所有者 NVIDIA公司(NVIDIA Corporation) ## 数据集创建日期 2026年4月15日 ## 许可/使用条款 数据集管理条款:本数据集的使用受知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)约束。 ## 预期用途 本数据集面向从事OCR信息检索相关工作的机器学习研究人员、AI工程师与开发者。 ## 数据集特征 **数据收集方式**:*[混合模式:人工、自动化、合成生成]* **标注方式**:*[不适用]* ## 数据集格式 OCR-Synthetic-Multilingual-v1/ ├── en/ │ ├── train/ │ │ ├── train_000.h5 │ │ ├── train_001.h5 │ │ └── ... │ ├── test/ │ │ └── ... │ └── validation/ │ └── ... ├── ja/ │ └── ... ├── ko/ │ └── ... ├── ru/ │ └── ... ├── zh_hans/ │ └── ... └── zh_hant/ └── ... ## 数据格式 — HDF5(分层数据格式5,Hierarchical Data Format 5) 每个`.h5`文件包含以下HDF5数据集: | 键名 | 数据类型 | 描述 | |---------------|-----------------------------------|------------------------------------------------------| | `images` | 对象(可变长度字节数组) | JPEG编码的图像字节,每个样本对应一条记录 | | `annotations` | 对象(可变长度字符串) | 每个样本对应的JSON字符串,包含边界框标注信息 | | `dimensions` | 整数数组 `[H, W]` | 原始图像的尺寸(高度H、宽度W) | | `labels` | 对象(字符串类型) | 整页文本标签 | | `qualities` | 整数 | 编码时使用的JPEG质量(通常为100) | | `sample_ids` | 整数 | 唯一的样本标识符 | ## 标注JSON Schema `annotations`中的每个条目均为JSON对象: json { "word_bboxes": [ { "text": "example word or phrase", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] } ], "line_bboxes": [ { "text": "full line of text", "bbox": [x, y, w, h], "quad": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]], "para_idx": 0, "line_idx": 0, "word_indices": [0, 1, 2] } ], "para_bboxes": [...], "relation_graph": [ [[0], [1], [2]], [[3], [4]] ] } ### 边界框层级 - **`word_bboxes`**:每个单词/短语对应一个条目,作为单独渲染单元。每个条目包含`text`(文本内容)、轴对齐的`bbox [x, y, w, h]`(边界框)以及4点`quad`(四边形)。 - **`line_bboxes`**:每个文本行对应一个条目。包含`word_bboxes`的全部字段,额外新增`para_idx`(段落索引)、`line_idx`(段落内的行索引)以及`word_indices`(组成该行的`word_bboxes`索引列表)。 - **`para_bboxes`**:每个段落边界框对应一个条目。 - **`relation_graph`**:嵌套列表,编码阅读顺序:`relation_graph[para][sentence]` 将返回该段落内对应句子所包含的单词/行索引列表。 ### 四边形顶点约定 四边形为4点多边形,以顺时针顺序存储为`[[x0,y0], [x1,y1], [x2,y2], [x3,y3]]`: v0 -------- v1 | | v3 -------- v2 ## 加载示例 python import h5py, io, json from PIL import Image with h5py.File("en/train/train_000.h5", "r") as f: img_bytes = f["images"][0] image = Image.open(io.BytesIO(img_bytes.tobytes())).convert("RGB") annotation = json.loads(f["annotations"][0]) for line in annotation["line_bboxes"]: print(line["text"], line["quad"]) --- ## 数据集量化 ### 语言分布 | 子文件夹名 | 语言名称 | 总样本数 | 训练集 | 测试集 | 验证集 | |-------------|--------------------------|---------------|----------------|----------------|----------------| | `en` | 英语(`en`) | 1,825,089 | 1,460,304(共63个文件) | 183,629(共63个文件) | 181,156(共63个文件) | | `ja` | 日语(`ja`) | 1,889,137 | 1,502,712(共67个文件) | 193,779(共67个文件) | 192,646(共67个文件) | | `ko` | 韩语(`ko`) | 2,269,540 | 1,814,994(共78个文件) | 227,091(共78个文件) | 227,455(共78个文件) | | `ru` | 俄语(`ru`) | 1,724,733 | 1,380,404(共59个文件) | 171,678(共59个文件) | 172,651(共59个文件) | | `zh_hans` | 简体中文(`zh_hans`) | 2,335,343 | 1,914,948(共83个文件) | 210,143(共73个文件) | 210,252(共73个文件) | | `zh_hant` | 繁体中文(`zh_hant`) | 2,214,304 | 1,772,280(共77个文件) | 221,867(共77个文件) | 220,157(共77个文件) | | **总计** | | **12,258,146** | **9,845,642** | **1,208,187** | **1,204,317** | > 括号内的数字为每个数据划分对应的`.h5`文件数量。 ## 相关模型 本数据集用于训练**Nemotron OCR v2**的检测、识别与关系建模组件。如需了解模型架构细节、评估结果与使用说明,请参阅模型卡片。 ## 单语言详细信息 ### 英语(`en`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 英语(`en`) | | 总样本数 | 1,825,089 | | 训练集 | 1,460,304 个样本(共63个文件) | | 测试集 | 183,629 个样本(共63个文件) | | 验证集 | 181,156 个样本(共63个文件) | ### 日语(`ja`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 日语(`ja`) | | 总样本数 | 1,889,137 | | 训练集 | 1,502,712 个样本(共67个文件) | | 测试集 | 193,779 个样本(共67个文件) | | 验证集 | 192,646 个样本(共67个文件) | ### 韩语(`ko`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 韩语(`ko`) | | 总样本数 | 2,269,540 | | 训练集 | 1,814,994 个样本(共78个文件) | | 测试集 | 227,091 个样本(共78个文件) | | 验证集 | 227,455 个样本(共78个文件) | ### 俄语(`ru`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 俄语(`ru`) | | 总样本数 | 1,724,733 | | 训练集 | 1,380,404 个样本(共59个文件) | | 测试集 | 171,678 个样本(共59个文件) | | 验证集 | 172,651 个样本(共59个文件) | ### 简体中文(`zh_hans`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 简体中文(`zh_hans`) | | 总样本数 | 2,335,343 | | 训练集 | 1,914,948 个样本(共83个文件) | | 测试集 | 210,143 个样本(共73个文件) | | 验证集 | 210,252 个样本(共73个文件) | ### 繁体中文(`zh_hant`) | 属性项 | 具体值 | |----------------|--------------------------------------| | 语言 | 繁体中文(`zh_hant`) | | 总样本数 | 2,214,304 | | 训练集 | 1,772,280 个样本(共77个文件) | | 测试集 | 221,867 个样本(共77个文件) | | 验证集 | 220,157 个样本(共77个文件) | 总数据存储量:5.45 TB ## 参考资料 本合成数据生成流水线基于Donut项目中的SynthDoG开发,经过大量修改以支持更多语言、自定义渲染效果、结构化边界框标注(包含阅读顺序图的词/行/段落层级标注)以及HDF5格式输出。 ## 伦理考量 NVIDIA认为可信AI是一项共同责任,我们已制定相关政策与实践规范,以支持各类AI应用的开发。开发者在按照服务条款下载或使用本数据集时,应与内部开发团队协作,确保本数据集符合相关行业与使用场景的要求,并规避潜在的产品误用风险。 请在此处报告质量问题、风险、安全漏洞或NVIDIA AI相关顾虑。 ## 致谢 本合成数据生成流水线基于Donut项目中的SynthDoG开发,经过大量修改以支持更多语言、自定义渲染效果、结构化边界框标注(包含阅读顺序图的词/行/段落层级标注)以及HDF5格式输出。 ## 引用 若您使用本数据集,请引用以下文献: bibtex @misc{chesler2026ocr_synthetic_multilingual, title = {{OCR-Synthetic-Multilingual-v1}}, author = {Chesler, Ryan}, year = {2026}, publisher = {NVIDIA}, url = {https://huggingface.co/datasets/nvidia/OCR-Synthetic-Multilingual-v1}, note = {Synthetically generated multilingual OCR dataset built on a heavily modified SynthDoG pipeline} }
提供机构:
maas
创建时间:
2026-04-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作