five

RVL-CDIP MP, RVL-CDIP-N MP

收藏
arXiv2023-10-31 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/bdpc/rvl_cdip_mp, https://huggingface.co/datasets/bdpc/rvl_cdip_n_mp
下载链接
链接失效反馈
官方服务:
资源简介:
RVL-CDIP MP和RVL-CDIP-N MP是由鲁汶大学的研究团队创建的两个大型多页文档分类数据集。这些数据集旨在解决现有数据集仅支持单页图像的问题,通过提供包含多个页面的文档来更真实地模拟实际应用场景。数据集包含多种文档类型,如信件、表格、电子邮件等,总计约400,000个样本。创建过程中,研究团队利用了OCR-IDL的元数据来匹配和检索原始文档。这些数据集的应用领域广泛,包括但不限于文档自动化处理、信息提取和分类,旨在提高文档处理的效率和准确性。

RVL-CDIP MP and RVL-CDIP-N MP are two large-scale multi-page document classification datasets constructed by a research team from KU Leuven. These datasets are developed to address the limitation that existing document datasets only support single-page image inputs, enabling more realistic simulation of real-world application scenarios by providing multi-page document samples. The datasets cover a diverse set of document types, such as letters, tables, emails, and more, with a total of approximately 400,000 samples. During their development, the research team utilized metadata from OCR-IDL to match and retrieve the original source documents. These datasets have a wide range of application scenarios, including but not limited to document automation processing, information extraction and classification, with the goal of enhancing the efficiency and accuracy of document processing tasks.
提供机构:
鲁汶大学
创建时间:
2023-08-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作