lenagibee/GenDocVQA
收藏Hugging Face2024-05-31 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/lenagibee/GenDocVQA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: unique_id
dtype: int64
- name: image_path
dtype: string
- name: ocr
sequence:
- name: text
dtype: string
- name: bbox
sequence: int64
- name: block_id
dtype: int64
- name: text_id
dtype: int64
- name: par_id
dtype: int64
- name: line_id
dtype: int64
- name: word_id
dtype: int64
- name: question
dtype: string
- name: answer
sequence: string
splits:
- name: train
num_bytes: 11642104684
num_examples: 260814
- name: validation
num_bytes: 1324439173
num_examples: 28473
download_size: 13295093966
dataset_size: 12966543857
license: other
task_categories:
- visual-question-answering
language:
- en
tags:
- documents
- vqa
- generative
- document understanding
size_categories:
- 100K<n<1M
---
# GenDocVQA
This dataset provides a broad set of documents with questions related to their contents.
These questions are non-extractive, meaning that the model, which solves our task should be
generative and compute the answers by itself.
## Dataset Details
## Uses
### Direct Use
In order to load dataset using following code:
```python
ds = datasets.load_dataset('lenagibee/GenDocVQA')
```
ds is a dict consisting from two splits `train` and `validation`.
To open the image use following example:
```python
from PIL import Image
im = Image.open(ds['train'][0]['image_path'])
```
Dataset generator:
https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/GenDocVQA.py?download=true
## Dataset Structure
All the necessary data is stored in the following archives:
* Images: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_imgs.tar.gz?download=true
* OCR: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_ocr.tar.gz?download=true
* Annotations: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_annotations.tar.gz?download=true
Data parsing is already implemented in the attached dataset generator.
Images should be processed by the user himself.
The train split contains 260814 questions and dev (validation) contains 28473.
### Features of dataset
The features of the dataset are the following:
```python
features = datasets.Features(
{
"unique_id": datasets.Value("int64"),
"image_path": datasets.Value("string"),
"ocr": datasets.Sequence(
feature={
'text': datasets.Value("string"),
'bbox': datasets.Sequence(datasets.Value("int64")),
'block_id': datasets.Value("int64"),
'text_id': datasets.Value("int64"),
'par_id': datasets.Value("int64"),
'line_id': datasets.Value("int64"),
'word_id': datasets.Value("int64")
}
),
"question": datasets.Value("string"),
"answer": datasets.Sequence(datasets.Value("string")),
}
```
#### Features description
* `unique_id` - integer, an id of a question
* `image_path` - string, path to the image for a question (includes downloaded path)
* `ocr` - dictionary, containing lists, where each element is an information related to a single word
* `text` - string, a word itself
* `bbox` - list of 4 integers, a bounding box of the word
* `block_id` - integer, an index of the block, where the word is located
* `text_id` - integer, an index of the set of paragraphs, where the word is located
* `par_id` - integer, an index of the paragraph, where the word is located
* `line_id` - integer, an index of the line, where the word is located
* `word_id` - integer, an index of the word
* `question` - string, containing the question
* `answer` - list of strings, containing the answers to the question, can be empty (non-answerable)
### Images
Are divided inside the archive into dev and train folders.
Just regular images in PNG, JPG formats.
You can use any image library to process them.
### OCR
Same as the Images are divided into dev and train folders.
Represented as JSON files.
#### OCR JSON Description
It is a list of elements, where each represents an information about the single word extracted
by the ABBYY FineReader OCR, and contains fields in following order:
1. `block_id` - integer, an index of the block, where the word is located
2. `text_id` - integer, an index of the set of paragraphs, where the word is located
3. `par_id` - integer, an index of the paragraph, where the word is located
4. `line_id` - integer, an index of the line, where the word is located
5. `word_id` - integer, an index of the word
6. `bbox` - list of 4 integers, a bounding box of the word
7. `text` - string, a word itself
### Annotations
dev (validation) and train splits are located in the archive.
Question lists are represtened by csv files with following columns:
1. `unique_id` - an id of the question
2. `split`
3. `question`
4. `answer`
5. `image_filename` - a filename of the related image
6. `ocr_filename` - a filename of the json file, containing the related OCR data
## Dataset Creation
### Source Data
The data for this dataset was collected from the following datasets:
1. SlideVQA - Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. "A Dataset for Document Visual Question Answering on Multiple Images". In Proc. of AAAI. 2023.
2. PDFVQA - Yihao Ding and Siwen Luo and Hyunsuk Chung and Soyeon Caren Han, PDFVQA: A New Dataset for Real-World VQA on PDF Documents, 2023
3. InfographicsVQA - InfographicVQA, Minesh Mathew and Viraj Bagal and Rubèn Pérez Tito and Dimosthenis Karatzas and Ernest Valveny and C. V Jawahar, 2021
4. TAT-DQA - Towards complex document understanding by discrete reasoning, Zhu, Fengbin and Lei, Wenqiang and Feng, Fuli and Wang, Chao and Zhang, Haozhou and Chua, Tat-Seng, 2022
5. DUDE - Document Understanding Dataset and Evaluation (DUDE), Jordy Van Landeghem and Rubén Tito and Łukasz Borchmann and Michał Pietruszka and Paweł Józiak and Rafał Powalski and Dawid Jurkiewicz and Mickaël Coustaty and Bertrand Ackaert and Ernest Valveny and Matthew Blaschko and Sien Moens and Tomasz Stanisławek, 2023
### Data Processing
The questions from each dataset were filtered by the types of the questions,
leaving only non-extractive questions, related to one page. After that the questions
were paraphrased.
### Source Data Licenses
The dataset adheres to the licenses of its constituents.
1. SlideVQA: https://github.com/nttmdlab-nlp/SlideVQA/blob/main/LICENSE
2. PDFVQA: https://github.com/adlnlp/pdfvqa (Unknown)
3. InfographicsVQA: https://www.docvqa.org/datasets/infographicvqa (Unknown)
4. TAT-DQA: https://nextplusplus.github.io/TAT-DQA/ (CC BY 4.0)
5. DUDE: https://github.com/duchallenge-team/dude/blob/main/LICENSE (GPL 3.0)
## Dataset Card Contact
Please feel free to contact in the community page of this dataset or via
the Telegram chat of the challenge:
https://t.me/gendocvqa2024
提供机构:
lenagibee
原始信息汇总
GenDocVQA 数据集概述
数据集详情
特征
- unique_id: 整数类型,问题的唯一标识符。
- image_path: 字符串类型,问题相关图像的路径。
- ocr: 字典类型,包含与单个单词相关的信息列表。
- text: 字符串类型,单词本身。
- bbox: 包含4个整数的列表,单词的边界框。
- block_id: 整数类型,单词所在块的索引。
- text_id: 整数类型,单词所在段落集的索引。
- par_id: 整数类型,单词所在段落的索引。
- line_id: 整数类型,单词所在行的索引。
- word_id: 整数类型,单词的索引。
- question: 字符串类型,包含问题。
- answer: 字符串列表,包含问题的答案,可能为空(不可回答)。
数据分割
- train: 包含260814个样本,总字节数为11642104684。
- validation: 包含28473个样本,总字节数为1324439173。
数据集大小
- 下载大小: 13295093966字节
- 数据集大小: 12966543857字节
许可
- 数据集遵循其他许可。
任务类别
- 视觉问答(Visual Question Answering, VQA)
语言
- 英语(en)
标签
- 文档
- VQA
- 生成式
- 文档理解
大小类别
- 100K<n<1M



