lenagibee/GenDocVQA

Name: lenagibee/GenDocVQA
Creator: lenagibee
Published: 2024-05-31 18:09:22
License: 暂无描述

Hugging Face2024-05-31 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/lenagibee/GenDocVQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: unique_id dtype: int64 - name: image_path dtype: string - name: ocr sequence: - name: text dtype: string - name: bbox sequence: int64 - name: block_id dtype: int64 - name: text_id dtype: int64 - name: par_id dtype: int64 - name: line_id dtype: int64 - name: word_id dtype: int64 - name: question dtype: string - name: answer sequence: string splits: - name: train num_bytes: 11642104684 num_examples: 260814 - name: validation num_bytes: 1324439173 num_examples: 28473 download_size: 13295093966 dataset_size: 12966543857 license: other task_categories: - visual-question-answering language: - en tags: - documents - vqa - generative - document understanding size_categories: - 100K<n<1M --- # GenDocVQA This dataset provides a broad set of documents with questions related to their contents. These questions are non-extractive, meaning that the model, which solves our task should be generative and compute the answers by itself. ## Dataset Details ## Uses ### Direct Use In order to load dataset using following code: ```python ds = datasets.load_dataset('lenagibee/GenDocVQA') ``` ds is a dict consisting from two splits `train` and `validation`. To open the image use following example: ```python from PIL import Image im = Image.open(ds['train'][0]['image_path']) ``` Dataset generator: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/GenDocVQA.py?download=true ## Dataset Structure All the necessary data is stored in the following archives: * Images: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_imgs.tar.gz?download=true * OCR: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_ocr.tar.gz?download=true * Annotations: https://huggingface.co/datasets/lenagibee/GenDocVQA/resolve/main/archives/gendocvqa2024_annotations.tar.gz?download=true Data parsing is already implemented in the attached dataset generator. Images should be processed by the user himself. The train split contains 260814 questions and dev (validation) contains 28473. ### Features of dataset The features of the dataset are the following: ```python features = datasets.Features( { "unique_id": datasets.Value("int64"), "image_path": datasets.Value("string"), "ocr": datasets.Sequence( feature={ 'text': datasets.Value("string"), 'bbox': datasets.Sequence(datasets.Value("int64")), 'block_id': datasets.Value("int64"), 'text_id': datasets.Value("int64"), 'par_id': datasets.Value("int64"), 'line_id': datasets.Value("int64"), 'word_id': datasets.Value("int64") } ), "question": datasets.Value("string"), "answer": datasets.Sequence(datasets.Value("string")), } ``` #### Features description * `unique_id` - integer, an id of a question * `image_path` - string, path to the image for a question (includes downloaded path) * `ocr` - dictionary, containing lists, where each element is an information related to a single word * `text` - string, a word itself * `bbox` - list of 4 integers, a bounding box of the word * `block_id` - integer, an index of the block, where the word is located * `text_id` - integer, an index of the set of paragraphs, where the word is located * `par_id` - integer, an index of the paragraph, where the word is located * `line_id` - integer, an index of the line, where the word is located * `word_id` - integer, an index of the word * `question` - string, containing the question * `answer` - list of strings, containing the answers to the question, can be empty (non-answerable) ### Images Are divided inside the archive into dev and train folders. Just regular images in PNG, JPG formats. You can use any image library to process them. ### OCR Same as the Images are divided into dev and train folders. Represented as JSON files. #### OCR JSON Description It is a list of elements, where each represents an information about the single word extracted by the ABBYY FineReader OCR, and contains fields in following order: 1. `block_id` - integer, an index of the block, where the word is located 2. `text_id` - integer, an index of the set of paragraphs, where the word is located 3. `par_id` - integer, an index of the paragraph, where the word is located 4. `line_id` - integer, an index of the line, where the word is located 5. `word_id` - integer, an index of the word 6. `bbox` - list of 4 integers, a bounding box of the word 7. `text` - string, a word itself ### Annotations dev (validation) and train splits are located in the archive. Question lists are represtened by csv files with following columns: 1. `unique_id` - an id of the question 2. `split` 3. `question` 4. `answer` 5. `image_filename` - a filename of the related image 6. `ocr_filename` - a filename of the json file, containing the related OCR data ## Dataset Creation ### Source Data The data for this dataset was collected from the following datasets: 1. SlideVQA - Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. "A Dataset for Document Visual Question Answering on Multiple Images". In Proc. of AAAI. 2023. 2. PDFVQA - Yihao Ding and Siwen Luo and Hyunsuk Chung and Soyeon Caren Han, PDFVQA: A New Dataset for Real-World VQA on PDF Documents, 2023 3. InfographicsVQA - InfographicVQA, Minesh Mathew and Viraj Bagal and Rubèn Pérez Tito and Dimosthenis Karatzas and Ernest Valveny and C. V Jawahar, 2021 4. TAT-DQA - Towards complex document understanding by discrete reasoning, Zhu, Fengbin and Lei, Wenqiang and Feng, Fuli and Wang, Chao and Zhang, Haozhou and Chua, Tat-Seng, 2022 5. DUDE - Document Understanding Dataset and Evaluation (DUDE), Jordy Van Landeghem and Rubén Tito and Łukasz Borchmann and Michał Pietruszka and Paweł Józiak and Rafał Powalski and Dawid Jurkiewicz and Mickaël Coustaty and Bertrand Ackaert and Ernest Valveny and Matthew Blaschko and Sien Moens and Tomasz Stanisławek, 2023 ### Data Processing The questions from each dataset were filtered by the types of the questions, leaving only non-extractive questions, related to one page. After that the questions were paraphrased. ### Source Data Licenses The dataset adheres to the licenses of its constituents. 1. SlideVQA: https://github.com/nttmdlab-nlp/SlideVQA/blob/main/LICENSE 2. PDFVQA: https://github.com/adlnlp/pdfvqa (Unknown) 3. InfographicsVQA: https://www.docvqa.org/datasets/infographicvqa (Unknown) 4. TAT-DQA: https://nextplusplus.github.io/TAT-DQA/ (CC BY 4.0) 5. DUDE: https://github.com/duchallenge-team/dude/blob/main/LICENSE (GPL 3.0) ## Dataset Card Contact Please feel free to contact in the community page of this dataset or via the Telegram chat of the challenge: https://t.me/gendocvqa2024

提供机构：

lenagibee

原始信息汇总

GenDocVQA 数据集概述

数据集详情

特征

unique_id: 整数类型，问题的唯一标识符。
image_path: 字符串类型，问题相关图像的路径。
ocr: 字典类型，包含与单个单词相关的信息列表。
- text: 字符串类型，单词本身。
- bbox: 包含4个整数的列表，单词的边界框。
- block_id: 整数类型，单词所在块的索引。
- text_id: 整数类型，单词所在段落集的索引。
- par_id: 整数类型，单词所在段落的索引。
- line_id: 整数类型，单词所在行的索引。
- word_id: 整数类型，单词的索引。
question: 字符串类型，包含问题。
answer: 字符串列表，包含问题的答案，可能为空（不可回答）。

数据分割

train: 包含260814个样本，总字节数为11642104684。
validation: 包含28473个样本，总字节数为1324439173。

数据集大小

下载大小: 13295093966字节
数据集大小: 12966543857字节

许可

数据集遵循其他许可。

任务类别

视觉问答（Visual Question Answering, VQA）

语言

英语（en）

大小类别

100K<n<1M

5,000+

优质数据集

54 个

任务类型

进入经典数据集