aharley/rvl_cdip

Name: aharley/rvl_cdip
Creator: aharley
Published: 2023-05-02 09:06:16
License: 暂无描述

Hugging Face2023-05-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/aharley/rvl_cdip

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - en license: - other multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - extended|iit_cdip task_categories: - image-classification task_ids: - multi-class-image-classification paperswithcode_id: rvl-cdip pretty_name: RVL-CDIP viewer: false dataset_info: features: - name: image dtype: image - name: label dtype: class_label: names: '0': letter '1': form '2': email '3': handwritten '4': advertisement '5': scientific report '6': scientific publication '7': specification '8': file folder '9': news article '10': budget '11': invoice '12': presentation '13': questionnaire '14': resume '15': memo splits: - name: train num_bytes: 38816373360 num_examples: 320000 - name: test num_bytes: 4863300853 num_examples: 40000 - name: validation num_bytes: 4868685208 num_examples: 40000 download_size: 38779484559 dataset_size: 48548359421 --- # Dataset Card for RVL-CDIP ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [The RVL-CDIP Dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/) - **Repository:** - **Paper:** [Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval](https://arxiv.org/abs/1502.07058) - **Leaderboard:** [RVL-CDIP leaderboard](https://paperswithcode.com/dataset/rvl-cdip) - **Point of Contact:** [Adam W. Harley](mailto:aharley@cmu.edu) ### Dataset Summary The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels. ### Supported Tasks and Leaderboards - `image-classification`: The goal of this task is to classify a given document into one of 16 classes representing document types (letter, form, etc.). The leaderboard for this task is available [here](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip). ### Languages All the classes and documents use English as their primary language. ## Dataset Structure ### Data Instances A sample from the training set is provided below : ``` { 'image': <PIL.TiffImagePlugin.TiffImageFile image mode=L size=754x1000 at 0x7F9A5E92CA90>, 'label': 15 } ``` ### Data Fields - `image`: A `PIL.Image.Image` object containing a document. - `label`: an `int` classification label. <details> <summary>Class Label Mappings</summary> ```json { "0": "letter", "1": "form", "2": "email", "3": "handwritten", "4": "advertisement", "5": "scientific report", "6": "scientific publication", "7": "specification", "8": "file folder", "9": "news article", "10": "budget", "11": "invoice", "12": "presentation", "13": "questionnaire", "14": "resume", "15": "memo" } ``` </details> ### Data Splits | |train|test|validation| |----------|----:|----:|---------:| |# of examples|320000|40000|40000| The dataset was split in proportions similar to those of ImageNet. - 320000 images were used for training, - 40000 images for validation, and - 40000 images for testing. ## Dataset Creation ### Curation Rationale From the paper: > This work makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories, useful for training new CNNs for document analysis. ### Source Data #### Initial Data Collection and Normalization The same as in the IIT-CDIP collection. #### Who are the source language producers? The same as in the IIT-CDIP collection. ### Annotations #### Annotation process The same as in the IIT-CDIP collection. #### Who are the annotators? The same as in the IIT-CDIP collection. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The dataset was curated by the authors - Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. ### Licensing Information RVL-CDIP is a subset of IIT-CDIP, which came from the [Legacy Tobacco Document Library](https://www.industrydocuments.ucsf.edu/tobacco/), for which license information can be found [here](https://www.industrydocuments.ucsf.edu/help/copyright/). ### Citation Information ```bibtex @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} } ``` ### Contributions Thanks to [@dnaveenr](https://github.com/dnaveenr) for adding this dataset.

注释创建者： - 公开获取语言来源： - 公开获取语言： - 英语（en）许可协议： - 其他多语言属性： - 单语言数据规模区间： - 10万 < 样本数 < 100万源数据集： - 扩展自IIT-CDIP 任务类别： - 图像分类（image-classification）任务子类别： - 多类别图像分类 PapersWithCode ID：rvl-cdip 数据集展示名称：RVL-CDIP 数据集查看器：未启用数据集信息：特征字段： - 名称：image 数据类型：图像 - 名称：label 数据类型：分类标签：类别名称： '0': 信件（letter） '1': 表单（form） '2': 电子邮件（email） '3': 手写文本（handwritten） '4': 广告（advertisement） '5': 科研报告（scientific report） '6': 学术出版物（scientific publication） '7': 规格说明书（specification） '8': 文件档案夹（file folder） '9': 新闻文章（news article） '10': 预算文档（budget） '11': 发票（invoice） '12': 演示文稿（presentation） '13': 调查问卷（questionnaire） '14': 简历（resume） '15': 备忘录（memo）数据划分： - 名称：训练集字节数：38816373360 样本数：320000 - 名称：测试集字节数：4863300853 样本数：40000 - 名称：验证集字节数：4868685208 样本数：40000 下载总大小：38779484559字节数据集总大小：48548359421字节 # RVL-CDIP 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [注释标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **主页：** [RVL-CDIP 数据集](https://www.cs.cmu.edu/~aharley/rvl-cdip/) - **代码仓库：** - **论文：** [深度卷积网络在文档图像分类与检索中的评估](https://arxiv.org/abs/1502.07058) - **排行榜：** [RVL-CDIP 数据集排行榜](https://paperswithcode.com/dataset/rvl-cdip) - **联系人：** [Adam W. Harley](mailto:aharley@cmu.edu) ### 数据集概述 RVL-CDIP（全称Ryerson Vision Lab Complex Document Information Processing，即瑞尔森视觉实验室复杂文档信息处理数据集）包含16个类别共40万张灰度图像，每个类别包含25000张图像。数据集划分为32万张训练图像、4万张验证图像与4万张测试图像。所有图像的最大维度均不超过1000像素。 ### 支持任务与排行榜 - `图像分类（image-classification）`：该任务的目标是将给定的文档图像分类为16个文档类型类别之一（如信件、表单等）。该任务的排行榜可参见[此处](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip)。 ### 语言所有类别与文档均以英语作为主要语言。 ## 数据集结构 ### 数据实例以下展示一个训练集的样本： { 'image': <Python图像处理库（PIL）Tiff图像插件.TiffImageFile image mode=L size=754x1000 at 0x7F9A5E92CA90>, 'label': 15 } ### 数据字段 - `image`：包含文档图像的`PIL.Image.Image`对象（Python图像处理库图像对象）。 - `label`：表示分类标签的整数。 <details> <summary>类别标签映射表</summary> json { "0": "信件（letter）", "1": "表单（form）", "2": "电子邮件（email）", "3": "手写文本（handwritten）", "4": "广告（advertisement）", "5": "科研报告（scientific report）", "6": "学术出版物（scientific publication）", "7": "规格说明书（specification）", "8": "文件档案夹（file folder）", "9": "新闻文章（news article）", "10": "预算文档（budget）", "11": "发票（invoice）", "12": "演示文稿（presentation）", "13": "调查问卷（questionnaire）", "14": "简历（resume）", "15": "备忘录（memo）" } </details> ### 数据划分 | |训练集|测试集|验证集| |----------|----:|----:|---------:| |样本数量|320000|40000|40000| 该数据集的划分比例参考了ImageNet的划分方式，其中32万张图像用于训练，4万张用于验证，剩余4万张用于测试。 ## 数据集构建 ### 构建初衷摘自原论文： > 本工作公开了IIT-CDIP数据集的一个新的带标注子集，该子集包含16个类别共40万张文档图像，可用于训练用于文档分析任务的新型卷积神经网络（CNN，Convolutional Neural Network）。 ### 源数据 #### 初始数据收集与标准化处理与IIT-CDIP数据集一致。 #### 源文本创作者与IIT-CDIP数据集一致。 ### 注释标注 #### 标注流程与IIT-CDIP数据集一致。 #### 标注人员与IIT-CDIP数据集一致。 ### 个人与敏感信息 [需要更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需要更多信息] ### 偏差分析 [需要更多信息] ### 其他已知局限性 [需要更多信息] ## 附加信息 ### 数据集维护者该数据集由Adam W. Harley、Alex Ufkes与Konstantinos G. Derpanis三位作者维护。 ### 许可协议信息 RVL-CDIP是IIT-CDIP数据集的子集，其原始数据来自[Legacy烟草文档库](https://www.industrydocuments.ucsf.edu/tobacco/)，该库的许可协议信息可参见[此处](https://www.industrydocuments.ucsf.edu/help/copyright/)。 ### 引用信息 bibtex @inproceedings{harley2015icdar, title = {深度卷积网络在文档图像分类与检索中的评估}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {国际文档分析与识别会议（ICDAR，International Conference on Document Analysis and Recognition)}, year = {2015} } ### 贡献者感谢[@dnaveenr](https://github.com/dnaveenr)添加该数据集。

提供机构：

aharley

原始信息汇总

数据集概述

数据集名称

名称: RVL-CDIP
别名: Ryerson Vision Lab Complex Document Information Processing

数据集基本信息

语言: 英语
许可证: 其他
多语言性: 单语
大小类别: 100K<n<1M
任务类别: 图像分类
任务ID: 多类图像分类
论文代码ID: rvl-cdip

数据集内容

图像数量: 400,000
类别数量: 16
训练集大小: 320,000图像
测试集大小: 40,000图像
验证集大小: 40,000图像
图像尺寸: 最大维度不超过1000像素

数据集结构

数据实例: 每个实例包含图像和标签
数据字段:
- image: 图像文件，类型为PIL.Image.Image
- label: 分类标签，类型为整数，对应16个类别
数据分割:
- 训练集: 320,000图像
- 测试集: 40,000图像
- 验证集: 40,000图像

数据集来源

源数据集: 扩展自IIT-CDIP

数据集使用注意事项

许可证信息: 参考Legacy Tobacco Document Library
引用信息: bibtex @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }

搜集汇总

数据集介绍

构建方式

在文档图像分析领域，RVL-CDIP数据集的构建源于对大规模标注文档图像的需求。该数据集从IIT-CDIP集合中精选而来，包含40万张灰度图像，均匀分布于16个文档类别，每个类别提供2.5万张样本。图像经过标准化处理，确保最大维度不超过1000像素，同时遵循与ImageNet相似的数据划分原则，将32万张图像用于训练，验证集和测试集各包含4万张图像，为模型训练与评估提供了坚实基础。

使用方法

该数据集主要用于图像分类任务，用户可通过HuggingFace平台直接加载，利用其预定义的数据划分进行模型训练、验证与测试。典型流程包括使用标准图像处理库读取图像数据，结合标签映射将类别索引转换为可读文本，继而构建分类模型。研究人员可基于训练集优化模型参数，利用验证集调整超参数，最终在测试集上评估性能，并可参考公开的排行榜比较不同方法的优劣，推动文档分析技术的进步。

背景与挑战

背景概述

在文档图像分析领域，高效且准确地识别与分类各类文档图像是推动信息自动化处理的关键。RVL-CDIP数据集由瑞尔森视觉实验室于2015年创建，主要研究人员包括Adam W. Harley、Alex Ufkes和Konstantinos G. Derpanis。该数据集旨在解决文档图像分类与检索的核心研究问题，通过提供包含16个类别、共计40万张灰度图像的大规模标注数据，为深度学习模型在文档分析任务中的性能评估奠定了坚实基础。其基于IIT-CDIP集合构建，不仅促进了卷积神经网络在文档理解中的应用，还对后续的文档图像识别研究产生了深远影响，成为该领域的重要基准之一。

当前挑战

RVL-CDIP数据集所针对的文档图像分类任务面临多重挑战。在领域问题层面，文档图像的多样性与复杂性，如手写体与印刷体混合、布局结构差异以及图像质量不均，增加了模型区分细粒度类别的难度。构建过程中，从庞大的IIT-CDIP原始集合中筛选并标注16个代表性类别，需确保类别平衡与数据一致性，同时处理图像尺寸标准化问题，以适配深度学习模型的输入要求。此外，数据源自历史烟草文档库，可能隐含领域特定偏差，对模型的泛化能力构成潜在限制。

常用场景

经典使用场景

在文档图像分析领域，RVL-CDIP数据集作为基准资源，其经典使用场景聚焦于文档图像分类任务。该数据集包含16个类别的灰度图像，涵盖信件、表格、电子邮件等常见文档类型，为研究者提供了标准化的评估平台。通过训练深度卷积神经网络，模型能够学习文档的视觉特征，实现自动化分类，推动了文档理解技术的进步。

解决学术问题

该数据集有效解决了文档图像分析中的多类别分类难题，为学术研究提供了大规模标注数据。其意义在于填补了传统方法在复杂文档处理上的不足，促进了深度学习在文档识别领域的应用。通过标准化评估，研究者能够比较不同模型的性能，加速了算法创新，对信息检索和数字化存档研究产生了深远影响。

实际应用

在实际应用中，RVL-CDIP数据集支持企业文档管理系统的自动化分类，例如将扫描文档归类为发票、报告或简历等类型。这提升了办公效率，减少了人工处理成本。此外，该数据集还可用于法律和医疗行业的文档数字化，帮助机构快速归档和检索关键信息，增强了数据处理的准确性和可扩展性。

数据集最近研究