five

Infinity-Doc-400K

收藏
魔搭社区2026-04-30 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/infly/Infinity-Doc-400K
下载链接
链接失效反馈
官方服务:
资源简介:
# Infinity-Doc-400K <div align="left"> 💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> | 🤗 <a href="https://huggingface.co/infly/Infinity-Parser-7B">Model</a> | 📄 <a href="https://arxiv.org/pdf/2506.03197">Paper</a> | 🚀 <a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">Demo</a> </div> # Overview Infinity-Doc-400K is an extended version of Infinity-Doc-55K, comprising 400K real-world and synthetic scanned documents. The dataset features rich layout variations and comprehensive structural annotations, enabling robust training of document parsing models. Additionally, this dataset encompasses a broad spectrum of document types, including financial reports, medical reports, academic reports, books, magazines, web pages, and synthetic documents. ![Image](assets/dataset_illustration.png) # Data Construction Pipeline To construct a comprehensive dataset for document parsing, we integrate both real-world and synthetic data generation pipelines. Our real-world data pipeline collects diverse scanned documents from various practical domains (such as financial reports, medical records, and academic papers), employing a multi-expert strategy with cross-validation to generate reliable pseudo-ground-truth annotations for structural elements like text, tables, and formulas. Complementing this, our synthetic data pipeline programmatically creates a wide array of documents by injecting content from sources like Wikipedia into predefined HTML layouts, rendering them into scanned formats, and extracting precise ground-truth annotations directly from the original HTML. This dual approach yields a rich, diverse, and cost-effective dataset with accurate and well-aligned supervision, effectively overcoming common issues of imprecise or inconsistent labeling found in other datasets and enabling robust training for end-to-end document parsing models. ![Image](assets/data_construction_pipeline.png) # Data Statistics | Document Type | Samples Number | BBox | Data Source | | :---: | :---: | :---: | :---: | | Academic Papers | 70,057 | ✅ | Web | | Books | 10,526 | | Web | | Financial Reports | 59,645 | ✅ | Web | | Magazines | 174,589 | ✅ | Web | | Medical Reports | 5,000 | | Web | | Synthetic Documents | 61,965 | ✅ | CC3M + Web + Wiki | | Web Pages | 4,999 | | Web | | All | 386,781 |||| # Data Structure - id: The MD5 hash of the image, which serves as its unique identifier. - image: The document image. - gt: The content of the document, formatted in Markdown/HTML. - bbox: The bounding box and category of elements in the document. - attributes: Metadata describing the document type and task category. # Citation ``` @misc{wang2025infinityparserlayoutaware, title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing}, author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Yanjie Liang and Zuming Huang and Haozhe Wang and Jun Huang and Ling Chen and Wei Chu and Yuan Qi}, year={2025}, eprint={2506.03197}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.03197}, } ``` # License This dataset is licensed under cc-by-nc-sa-4.0.

# Infinity-Doc-400K <div align="left"> 💻 <a href="https://github.com/infly-ai/INF-MLLM">GitHub</a> | 🤗 <a href="https://huggingface.co/infly/Infinity-Parser-7B">模型</a> | 📄 <a href="https://arxiv.org/pdf/2506.03197">论文</a> | 🚀 <a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">演示Demo</a> </div> # 数据集概览 Infinity-Doc-400K 是 Infinity-Doc-55K 的扩展版本,包含40万份真实世界与合成扫描文档。该数据集具备丰富的布局变体与全面的结构标注,可支撑文档解析模型的稳健训练。此外,本数据集涵盖了广泛的文档类型,包括财务报告、医学报告、学术报告、书籍、杂志、网页以及合成文档。 ![Image](assets/dataset_illustration.png) # 数据构建流水线 为构建面向文档解析的全面数据集,我们整合了真实数据与合成数据生成两条流水线。真实数据流水线从多个实际领域(如财务报告、医疗记录与学术论文)收集多样化的扫描文档,并采用多专家策略结合交叉验证,为文本、表格、公式等结构元素生成可靠的伪真值标注。作为补充,合成数据流水线通过将维基百科等来源的内容注入预定义HTML布局,将其渲染为扫描格式,并直接从原始HTML中提取精确的真值标注,以此程序化生成大量文档。这种双轨方案产出了丰富多样且成本高效的数据集,其监督信号准确且对齐一致,有效克服了其他数据集常见的标注不准确或不一致问题,可支撑端到端文档解析模型的稳健训练。 ![Image](assets/data_construction_pipeline.png) # 数据统计 | 文档类型 | 样本数量 | 边界框(BBox) | 数据来源 | | :---: | :---: | :---: | :---: | | 学术论文 | 70,057 | ✅ | 网页 | | 书籍 | 10,526 | | 网页 | | 财务报告 | 59,645 | ✅ | 网页 | | 杂志 | 174,589 | ✅ | 网页 | | 医学报告 | 5,000 | | 网页 | | 合成文档 | 61,965 | ✅ | CC3M + 网页 + 维基百科 | | 网页 | 4,999 | | 网页 | | 总计 | 386,781 | | | # 数据结构 - id:图像的MD5哈希值,作为其唯一标识符。 - image:文档图像。 - gt:文档内容,采用Markdown/HTML格式进行格式化。 - bbox:文档中元素的边界框(BBox)与类别信息。 - attributes:描述文档类型与任务类别的元数据。 # 引用 @misc{wang2025infinityparserlayoutaware, title={Infinity Parser:面向扫描文档解析的布局感知强化学习}, author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Yanjie Liang and Zuming Huang and Haozhe Wang and Jun Huang and Ling Chen and Wei Chu and Yuan Qi}, year={2025}, eprint={2506.03197}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.03197}, } # 许可协议 本数据集采用 CC-BY-NC-SA-4.0 许可协议进行授权。
提供机构:
maas
创建时间:
2025-10-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作