Infinity-Doc-55K
收藏魔搭社区2026-01-02 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/infly/Infinity-Doc-55K
下载链接
链接失效反馈官方服务:
资源简介:
# Infinity-Doc-55K
<a href="https://www.arxiv.org/pdf/2506.03197"><img src="assets/logo.png" height="16" width="16" style="display: inline"><b> Paper </b></a> |
<a href="https://github.com/infly-ai/INF-MLLM"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" height="16" width="16" style="display: inline"><b> Github </b></a> |
<a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">💬<b> Web Demo </b></a>
# Overview
Infinity-Doc-55K is a high-quality diverse full-text parsing dataset, comprising 55K real-world and synthetic scanned documents. The dataset features rich layout variations and comprehensive structural annotations, enabling robust training of document parsing models. Additionally, this dataset encompasses a broad spectrum of document types, including financial reports, medical reports, academic reports, books, magazines, web pages, and synthetic documents.

# Data Construction Pipeline
To construct a comprehensive dataset for document parsing, we integrate both real-world and synthetic data generation pipelines. Our real-world data pipeline collects diverse scanned documents from various practical domains (such as financial reports, medical records, and academic papers), employing a multi-expert strategy with cross-validation to generate reliable pseudo-ground-truth annotations for structural elements like text, tables, and formulas. Complementing this, our synthetic data pipeline programmatically creates a wide array of documents by injecting content from sources like Wikipedia into predefined HTML layouts, rendering them into scanned formats, and extracting precise ground-truth annotations directly from the original HTML. This dual approach yields a rich, diverse, and cost-effective dataset with accurate and well-aligned supervision, effectively overcoming common issues of imprecise or inconsistent labeling found in other datasets and enabling robust training for end-to-end document parsing models.

# Data Statistics
| Document Type | Samples Number | Data Source |
| :---: | :---: | :---: |
| Synthetic Documents | 6.5k | CC3M + Web + Wiki |
| Financial Reports | 16.1k | Web |
| Medical Reports |5k| Web |
| Academic Papers | 8.9k | Web |
| Books | 10.5k | Web |
| Magazines | 3k | Web |
| Web Pages | 5k | Web |
| All | 55k |||
# Data Structure
- id: The MD5 hash of the image, which serves as its unique identifier.
- image: The document image.
- gt: The content of the document, formatted in Markdown/HTML.
- attributes: Metadata describing the document type and task category.
# Citation
```
@misc{wang2025infinityparserlayoutaware,
title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing},
author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Yanjie Liang and Zuming Huang and Haozhe Wang and Jun Huang and Ling Chen and Wei Chu and Yuan Qi},
year={2025},
eprint={2506.03197},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.03197},
}
```
# License
This dataset is licensed under cc-by-nc-sa-4.0.
# Infinity-Doc-55K
<a href="https://www.arxiv.org/pdf/2506.03197"><img src="assets/logo.png" height="16" width="16" style="display: inline"><b> 论文 </b></a> |
<a href="https://github.com/infly-ai/INF-MLLM"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" height="16" width="16" style="display: inline"><b> Github 仓库 </b></a> |
<a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">💬<b> 网页演示 </b></a>
# 概述
Infinity-Doc-55K是一款高质量、多样化的全文解析数据集,包含55k份真实世界与合成扫描文档。该数据集具备丰富的版式变化与全面的结构标注,可用于训练鲁棒性优异的文档解析模型。此外,本数据集涵盖了广泛的文档类型,包括财务报告、医疗报告、学术报告、书籍、杂志、网页以及合成文档。

# 数据构建流程
为构建适用于文档解析的全面数据集,我们整合了真实数据与合成数据两种生成流程。我们的真实数据流程从多个实用领域(如财务报告、医疗记录与学术论文)收集多样化的扫描文档,并采用多专家交叉验证策略,为文本、表格、公式等结构元素生成可靠的伪真值(pseudo-ground-truth)标注。作为补充,我们的合成数据流程通过程序化方式生成大量多样化文档:将维基百科等来源的内容注入预设的HTML版式中,将其渲染为扫描格式,并直接从原始HTML中提取精确的真值(ground-truth)标注。这种双重方法生成了一个丰富多样、成本效益高的数据集,具备准确且对齐一致的监督信号,有效克服了其他数据集常见的标注不精确或不一致问题,可为端到端文档解析模型的鲁棒训练提供有力支撑。

# 数据统计
| 文档类型 | 样本数量 | 数据来源 |
| :---: | :---: | :---: |
| 合成文档 | 6.5k | CC3M + 网页 + 维基百科 |
| 财务报告 | 16.1k | 网页 |
| 医疗报告 | 5k | 网页 |
| 学术论文 | 8.9k | 网页 |
| 书籍 | 10.5k | 网页 |
| 杂志 | 3k | 网页 |
| 网页 | 5k | 网页 |
| 总计 | 55k | |
# 数据结构
- id:文档图像的MD5哈希值,作为其唯一标识符。
- image:文档图像文件。
- gt:文档的真值内容,采用Markdown/HTML格式。
- attributes:描述文档类型与任务类别的元数据。
# 引用
@misc{wang2025infinityparserlayoutaware,
title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing},
author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Yanjie Liang and Zuming Huang and Haozhe Wang and Jun Huang and Ling Chen and Wei Chu and Yuan Qi},
year={2025},
eprint={2506.03197},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.03197},
}
# 许可协议
本数据集采用cc-by-nc-sa-4.0许可协议进行授权。
提供机构:
maas
创建时间:
2025-10-31



