jordyvl/rvl_cdip_easyocr
收藏Hugging Face2023-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jordyvl/rvl_cdip_easyocr
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- en
license:
- other
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- extended|iit_cdip
task_categories:
- image-classification
task_ids:
- multi-class-image-classification
paperswithcode_id: rvl-cdip
pretty_name: RVL-CDIP-EasyOCR
dataset_info:
features:
- name: id
dtype: string
- name: image
dtype: image
- name: label
dtype:
class_label:
names:
'0': letter
'1': form
'2': email
'3': handwritten
'4': advertisement
'5': scientific report
'6': scientific publication
'7': specification
'8': file folder
'9': news article
'10': budget
'11': invoice
'12': presentation
'13': questionnaire
'14': resume
'15': memo
- name: words
sequence: string
- name: boxes
sequence:
sequence: int32
---
# Dataset Card for RVL-CDIP
## Extension
The data loader provides support for loading easyOCR files together with the images
It is not included under '../data', yet is available upon request via email <firstname@contract.fit>.
## Table of Contents
- [Dataset Card for RVL-CDIP](#dataset-card-for-rvl-cdip)
- [Extension](#extension)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [The RVL-CDIP Dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/)
- **Repository:**
- **Paper:** [Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval](https://arxiv.org/abs/1502.07058)
- **Leaderboard:** [RVL-CDIP leaderboard](https://paperswithcode.com/dataset/rvl-cdip)
- **Point of Contact:** [Adam W. Harley](mailto:aharley@cmu.edu)
### Dataset Summary
The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.
### Supported Tasks and Leaderboards
- `image-classification`: The goal of this task is to classify a given document into one of 16 classes representing document types (letter, form, etc.). The leaderboard for this task is available [here](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip).
### Languages
All the classes and documents use English as their primary language.
## Dataset Structure
### Data Instances
A sample from the training set is provided below :
```
{
'image': <PIL.TiffImagePlugin.TiffImageFile image mode=L size=754x1000 at 0x7F9A5E92CA90>,
'label': 15
}
```
### Data Fields
- `image`: A `PIL.Image.Image` object containing a document.
- `label`: an `int` classification label.
<details>
<summary>Class Label Mappings</summary>
```json
{
"0": "letter",
"1": "form",
"2": "email",
"3": "handwritten",
"4": "advertisement",
"5": "scientific report",
"6": "scientific publication",
"7": "specification",
"8": "file folder",
"9": "news article",
"10": "budget",
"11": "invoice",
"12": "presentation",
"13": "questionnaire",
"14": "resume",
"15": "memo"
}
```
</details>
### Data Splits
| |train|test|validation|
|----------|----:|----:|---------:|
|# of examples|320000|40000|40000|
The dataset was split in proportions similar to those of ImageNet.
- 320000 images were used for training,
- 40000 images for validation, and
- 40000 images for testing.
## Dataset Creation
### Curation Rationale
From the paper:
> This work makes available a new labelled subset of the IIT-CDIP collection, containing 400,000
document images across 16 categories, useful for training new CNNs for document analysis.
### Source Data
#### Initial Data Collection and Normalization
The same as in the IIT-CDIP collection.
#### Who are the source language producers?
The same as in the IIT-CDIP collection.
### Annotations
#### Annotation process
The same as in the IIT-CDIP collection.
#### Who are the annotators?
The same as in the IIT-CDIP collection.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
The dataset was curated by the authors - Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis.
### Licensing Information
RVL-CDIP is a subset of IIT-CDIP, which came from the [Legacy Tobacco Document Library](https://www.industrydocuments.ucsf.edu/tobacco/), for which license information can be found [here](https://www.industrydocuments.ucsf.edu/help/copyright/).
### Citation Information
```bibtex
@inproceedings{harley2015icdar,
title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
year = {2015}
}
```
### Contributions
Thanks to [@dnaveenr](https://github.com/dnaveenr) for adding this dataset.
提供机构:
jordyvl
原始信息汇总
数据集概述
数据集名称
- 名称: RVL-CDIP-EasyOCR
- 别名: RVL-CDIP
数据集基本信息
- 语言: 英语 (en)
- 多语言性: 单语
- 许可证: 其他
- 大小类别: 100K<n<1M
- 源数据集: 扩展自 iit_cdip
- 任务类别: 图像分类
- 任务ID: 多类图像分类
- paperswithcode ID: rvl-cdip
数据集特征
- id: 字符串类型
- image: 图像类型
- label: 类别标签,包括以下类别:
- 0: letter
- 1: form
- 2: email
- 3: handwritten
- 4: advertisement
- 5: scientific report
- 6: scientific publication
- 7: specification
- 8: file folder
- 9: news article
- 10: budget
- 11: invoice
- 12: presentation
- 13: questionnaire
- 14: resume
- 15: memo
- words: 字符串序列
- boxes: 整数序列序列
数据集结构
- 数据实例: 包含 image 和 label 字段
- 数据字段:
image: PIL.Image.Image 对象label: 整数分类标签
数据集划分
- 训练集: 320,000 张图像
- 验证集: 40,000 张图像
- 测试集: 40,000 张图像
数据集用途
- 任务: 图像分类
- 目标: 将文档图像分类到16个类别中的一个
数据集来源
- 来源: 扩展自 IIT-CDIP 数据集
- 初始数据收集: 同 IIT-CDIP 数据集
- 注释者: 同 IIT-CDIP 数据集
许可证信息
- 许可证: 参考 Legacy Tobacco Document Library 的版权信息
引用信息
bibtex @inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }



