BigDocs-7.5M

Name: BigDocs-7.5M
Creator: ServiceNow
Published: 2024-12-06 05:41:20
License: 暂无描述

arXiv2024-12-06 更新2024-12-10 收录

下载链接：

https://bigdocs.github.io

下载链接

链接失效反馈

官方服务：

资源简介：

BigDocs-7.5M是由ServiceNow等机构创建的一个大规模、开放访问的多模态文档数据集，包含750万条图像-文本对，涵盖30个任务。数据集通过高效的数据筛选和处理流程确保高质量和许可宽松。创建过程强调透明度和可追溯性，通过过滤规则、可追溯的元数据和内容分析确保数据质量。该数据集主要应用于文档理解和代码生成任务，旨在解决现有数据集访问受限和许可限制的问题，促进多模态AI技术的发展。

BigDocs-7.5M is a large-scale, openly accessible multimodal document dataset created by ServiceNow and other institutions. It contains 7.5 million image-text pairs covering 30 tasks. The dataset ensures high data quality and permissive licensing through efficient data filtering and processing workflows. Its creation process emphasizes transparency and traceability, with data quality guaranteed via filtering rules, traceable metadata and content analysis. This dataset is primarily applied to document understanding and code generation tasks, aiming to address the issues of restricted access and licensing limitations of existing datasets, and advance the development of multimodal AI technologies.

提供机构：

ServiceNow

创建时间：

2024-12-06

搜集汇总

数据集介绍

构建方式

BigDocs-7.5M is meticulously curated from 133 publicly available datasets, with a rigorous filtering process ensuring only those with permissive licenses are included. The dataset comprises 7.5 million image-text pairs, sourced from 16 fully accessible datasets and augmented with crawled data from permissive websites and academic papers. The curation process involves OCR extraction, HTML/SVG/MD curation, content filtering, VQA format conversion, and text-image alignment, ensuring a high-quality, license-permissive dataset suitable for multimodal document understanding tasks.

特点

BigDocs-7.5M stands out for its comprehensive coverage of 30 tasks across document information extraction, understanding, and creation/manipulation. The dataset emphasizes accountability, responsibility, and transparency through detailed metadata and careful content analysis. It prioritizes datasets with permissive licenses, ensuring full public accessibility and commercial use. Additionally, the BigDocs Toolkit offers modular tools for data preprocessing, filtering, and consolidation, enhancing dataset traceability and usability.

使用方法

BigDocs-7.5M is designed for continual pretraining and downstream finetuning of multimodal models on document-related tasks. The dataset supports three core areas: document information extraction, document understanding, and document creation/manipulation. Researchers and practitioners can utilize the BigDocs Toolkit for efficient data curation, filtering, and preparation for model training. The dataset and toolkit are openly available, fostering community collaboration and responsible AI development in document intelligence.

背景与挑战

背景概述

BigDocs-7.5M, introduced by a collaborative effort involving ServiceNow, Mila, École de Technologie Supérieure, and other institutions, is a pioneering dataset designed to address the limitations of multimodal AI models in document and code-related tasks. Created in response to the growing demand for sophisticated document understanding technologies, BigDocs-7.5M comprises 7.5 million multimodal documents across 30 tasks. The dataset emphasizes accountability, responsibility, and transparency through a meticulous data curation process that includes filtering rules, traceable metadata, and careful content analysis. By providing a high-quality, open-access resource, BigDocs-7.5M aims to bridge the gap in training data and restrictive licensing, fostering advancements in multimodal capabilities and document reasoning for both academic and open-source communities.

当前挑战

The creation of BigDocs-7.5M presents several significant challenges. Firstly, the dataset addresses the critical need for high-quality, permissive-licensed training data in multimodal AI, which has been a limiting factor in the commercial application of these technologies. Secondly, the curation process involves rigorous filtering and quality control to ensure the dataset's integrity and usability. This includes managing data contamination, ensuring licensing compliance, and developing a unified metadata framework for enhanced traceability. Additionally, the dataset introduces novel downstream tasks that require generating long-structured code outputs from images, which poses challenges in model training and evaluation. The success of BigDocs-7.5M hinges on overcoming these challenges to provide a robust resource for advancing document intelligence and multimodal AI capabilities.

常用场景

经典使用场景

BigDocs-7.5M 数据集的经典使用场景主要集中在多模态模型的训练上，特别是在文档和代码任务中。该数据集通过提供高质量、开放访问的文档数据，支持模型在处理收据、理解工作流程、从文档中提取数据以及生成报告等任务中的表现。此外，BigDocs-7.5M 还支持代码生成任务，这些任务需要生成结构化的长输出，如 HTML 和 LaTeX。

解决学术问题

BigDocs-7.5M 数据集解决了多模态 AI 在文档理解任务中的常见学术研究问题，如光学字符识别（OCR）、文档分类、问答系统以及文档信息提取等。通过提供大规模、高质量的开放数据集，BigDocs-7.5M 促进了学术界和开源社区在多模态能力提升和文档推理方面的研究，推动了相关领域的发展。

衍生相关工作

BigDocs-7.5M 数据集的发布催生了一系列相关的经典工作，包括在多模态模型训练、文档理解任务中的新方法和技术的研究。例如，BigDocs-Bench 基准测试套件的引入，为评估多模态模型在生成长结构化代码输出方面的能力提供了新的标准。此外，BigDocs 工具包的开发也为数据预处理、过滤和整合提供了模块化工具，进一步推动了多模态数据集的研究和应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集