jordyvl/rvl_cdip_easyocr

Name: jordyvl/rvl_cdip_easyocr
Creator: jordyvl
Published: 2023-10-20 18:43:34
License: 暂无描述

Hugging Face2023-10-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jordyvl/rvl_cdip_easyocr

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - en license: - other multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - extended|iit_cdip task_categories: - image-classification task_ids: - multi-class-image-classification paperswithcode_id: rvl-cdip pretty_name: RVL-CDIP-EasyOCR dataset_info: features: - name: id dtype: string - name: image dtype: image - name: label dtype: class_label: names: '0': letter '1': form '2': email '3': handwritten '4': advertisement '5': scientific report '6': scientific publication '7': specification '8': file folder '9': news article '10': budget '11': invoice '12': presentation '13': questionnaire '14': resume '15': memo - name: words sequence: string - name: boxes sequence: sequence: int32 --- # Dataset Card for RVL-CDIP ## Extension The data loader provides support for loading easyOCR files together with the images It is not included under '../data', yet is available upon request via email <firstname@contract.fit>. ## Table of Contents - [Dataset Card for RVL-CDIP](#dataset-card-for-rvl-cdip) - [Extension](#extension) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [The RVL-CDIP Dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/) - **Repository:** - **Paper:** [Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval](https://arxiv.org/abs/1502.07058) - **Leaderboard:** [RVL-CDIP leaderboard](https://paperswithcode.com/dataset/rvl-cdip) - **Point of Contact:** [Adam W. Harley](mailto:aharley@cmu.edu) ### Dataset Summary The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels. ### Supported Tasks and Leaderboards - `image-classification`: The goal of this task is to classify a given document into one of 16 classes representing document types (letter, form, etc.). The leaderboard for this task is available [here](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip). ### Languages All the classes and documents use English as their primary language. ## Dataset Structure ### Data Instances A sample from the training set is provided below : ``` { 'image': <PIL.TiffImagePlugin.TiffImageFile image mode=L size=754x1000 at 0x7F9A5E92CA90>, 'label': 15 } ``` ### Data Fields - `image`: A `PIL.Image.Image` object containing a document. - `label`: an `int` classification label. <details> <summary>Class Label Mappings</summary> ```json { "0": "letter", "1": "form", "2": "email", "3": "handwritten", "4": "advertisement", "5": "scientific report", "6": "scientific publication", "7": "specification", "8": "file folder", "9": "news article", "10": "budget", "11": "invoice", "12": "presentation", "13": "questionnaire", "14": "resume", "15": "memo" } ``` </details> ### Data Splits | |train|test|validation| |----------|----:|----:|---------:| |# of examples|320000|40000|40000| The dataset was split in proportions similar to those of ImageNet. - 320000 images were used for training, - 40000 images for validation, and - 40000 images for testing. ## Dataset Creation ### Curation Rationale From the paper: > This work makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories, useful for training new CNNs for document analysis. ### Source Data #### Initial Data Collection and Normalization The same as in the IIT-CDIP collection. #### Who are the source language producers? The same as in the IIT-CDIP collection. ### Annotations #### Annotation process The same as in the IIT-CDIP collection. #### Who are the annotators? The same as in the IIT-CDIP collection. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The dataset was curated by the authors - Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. ### Licensing Information RVL-CDIP is a subset of IIT-CDIP, which came from the [Legacy Tobacco Document Library](https://www.industrydocuments.ucsf.edu/tobacco/), for which license information can be found [here](https://www.industrydocuments.ucsf.edu/help/copyright/). ### Citation Information ```bibtex @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} } ``` ### Contributions Thanks to [@dnaveenr](https://github.com/dnaveenr) for adding this dataset.

提供机构：

jordyvl

原始信息汇总

数据集概述

数据集名称

名称: RVL-CDIP-EasyOCR
别名: RVL-CDIP

数据集基本信息

语言: 英语 (en)
多语言性: 单语
许可证: 其他
大小类别: 100K<n<1M
源数据集: 扩展自 iit_cdip
任务类别: 图像分类
任务ID: 多类图像分类
paperswithcode ID: rvl-cdip

数据集特征

id: 字符串类型
image: 图像类型
label: 类别标签，包括以下类别：
- 0: letter
- 1: form
- 2: email
- 3: handwritten
- 4: advertisement
- 5: scientific report
- 6: scientific publication
- 7: specification
- 8: file folder
- 9: news article
- 10: budget
- 11: invoice
- 12: presentation
- 13: questionnaire
- 14: resume
- 15: memo
words: 字符串序列
boxes: 整数序列序列

数据集结构

数据实例: 包含 image 和 label 字段
数据字段:
- image: PIL.Image.Image 对象
- label: 整数分类标签

数据集划分

训练集: 320,000 张图像
验证集: 40,000 张图像
测试集: 40,000 张图像

数据集用途

任务: 图像分类
目标: 将文档图像分类到16个类别中的一个

数据集来源

来源: 扩展自 IIT-CDIP 数据集
初始数据收集: 同 IIT-CDIP 数据集
注释者: 同 IIT-CDIP 数据集

许可证信息

许可证: 参考 Legacy Tobacco Document Library 的版权信息

引用信息

bibtex @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集