Mozilla/docornot
收藏Hugging Face2024-05-05 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Mozilla/docornot
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
dataset_info:
features:
- name: image
dtype: image
- name: is_document
dtype:
class_label:
names:
'0': 'no'
'1': 'yes'
splits:
- name: train
num_bytes: 3747106867.2
num_examples: 12800
- name: test
num_bytes: 468388358.4
num_examples: 1600
- name: validation
num_bytes: 468388358.4
num_examples: 1600
download_size: 4682888903
dataset_size: 4683883584.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
---
The `DocOrNot` dataset contains 50% of images that are pictures, and 50% that are documents.
It was built using 8k images from each one of these sources:
- RVL CDIP (Small) - https://www.kaggle.com/datasets/uditamin/rvl-cdip-small - license: https://www.industrydocuments.ucsf.edu/help/copyright/
- Flickr8k - https://www.kaggle.com/datasets/adityajn105/flickr8k - license: https://creativecommons.org/publicdomain/zero/1.0/
It can be used to train a model and classify an image as being a picture or a document.
Source code used to generate this dataset : https://github.com/mozilla/docornot
提供机构:
Mozilla
原始信息汇总
数据集概述
数据集信息
-
特征:
image: 图像数据is_document: 分类标签,表示是否为文档- 标签名称:
0: no1: yes
- 标签名称:
-
数据分割:
train:- 字节数: 3747106867.2
- 样本数: 12800
test:- 字节数: 468388358.4
- 样本数: 1600
validation:- 字节数: 468388358.4
- 样本数: 1600
-
数据大小:
- 下载大小: 4682888903
- 数据集大小: 4683883584.0
配置
- 默认配置:
- 数据文件路径:
train:data/train-*test:data/test-*validation:data/validation-*
- 数据文件路径:
数据集描述
DocOrNot数据集包含50%的图像为普通图片,50%为文档图片。- 数据集构建使用了以下来源的8k张图像:
- RVL CDIP (Small)
- Flickr8k
- 该数据集可用于训练模型,以分类图像为普通图片或文档图片。



