AmazonScience/WikiDT

Name: AmazonScience/WikiDT
Creator: AmazonScience
Published: 2023-04-17 18:27:35
License: 暂无描述

Hugging Face2023-04-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AmazonScience/WikiDT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - table-question-answering - question-answering language: - en tags: - documents - tables - VQA pretty_name: WikiDT size_categories: - 100K<n<1M --- # WikiDT: Wikipedia Table Document dataset for table extraction and visual question answering ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The WikiDT contains multi-level annotations and labels for the question-answering task based on images. Meanwhile, as the questions are answered from some table on the image, and WikiDT provides the table annotation to facilitate the diagnosis of the models and decompose the problem, WikiDT can be also directly used as a table recognition dataset. The dataset contains 16,887 Wikipedia screenshot, which are segmented to 54,032 subpages since the full screenshots are potentially long. In total, there's 159,905 tables in the dataset. The number of question-answer samples is 70,652. Each QA sample contains triplets of <question, answer, full-page screenshot filename>, and is additionally annotated with retrieval labels (which subpage, and which table). 53,698 QA samples also have SQL annotation. For each subpage, OCR and table extraction annotations from two sources are available. While rendering the screenshots, the ground truth table annotation is recorded. Meanwhile, to make the dataset realistic, we also requested OCR and table extraction from [Amazon Textract](https://aws.amazon.com/textract/) for each subpage (results obtained during Feb.28, 2023 - Mar.6, 2023). ### Languages English ## Dataset Structure Once downloaded, the WikiDT has the following parts. The downloaded files are around 77GB. Please ensure you have at least 160GB since we will be extract individual files from the tars. ``` . ├── WikiTableExtraction │ ├── detection.partaa │ ├── detection.partab │ ├── detection.partac │ ├── detection.partad │ ├── detection.partae │ ├── detection.partaf │ ├── detection.partag │ ├── structure.partaa │ ├── structure.partab │ ├── structure.partac │ ├── structure.partad │ └── structure.partae ├── images.partaa ├── images.partab ├── images.partac ├── images.partad ├── images.partae ├── images.partaf ├── images.partag ├── images.partah ├── images.partai ├── ocr.tar ├── samples │ ├── test.json │ ├── train.json │ └── val.json └── tsv.tar ``` Please concat the part files and extract them into respective folder. For example, run ``` cd WikiTableExtraction/ cat detection.parta* | tar x ``` to extract the `detection` folder. Once you extracted all the tar files, the WikiDT dataset has the following file structure. ```sh +--WikiDT-dataset | +--WikiTableExtraction | | +--detection | | | +--images # sub page images | | | +--train # xml table bbox annotation | | | +--test # xml table bbox annotation | | | +--val # xml table bbox annotation | | | images_filelist.txt # index of 54,032 images | | | test_filelist.txt # index of 5,410 test samples | | | train_filelist.txt # index of 43,248 train samples | | | val_filelist.txt # index of 5,347 val samples | | +--structure | | | +--images # images cropped to table region | | | +--train # xml table bbox annotation | | | +--test # xml table bbox annotation | | | +--val # xml table bbox annotation | | | images_filelist.txt # index of 159,898 images | | | test_filelist.txt # index of 15,989 test samples | | | train_filelist.txt # index of 129,980 train samples | | | val_filelist.txt # index of 15,991 val samples | +--samples # in total 70,652 TableVQA samples from the three json files | | +--train.json # | | +--test.json # | | +--val.json # | +--images # full page image | +--ocr # text and bbox for the table content | | +--textract # detected by Amazon Textract API | | +--web # extracted from HTML information | +--tsv # extracted table in tsv format | | +--textract # detected by Amazon Textract API | | +--web # extracted from HTML information ``` ### Table VQA annotation example Here is an example of an xml table bbox annotation from `WikiDT-dataset/samples/[train|test|val].json/`. ``` {'all_ocr_files_textract': ['ocr/textract/16301437_page_seg_0.json', 'ocr/textract/16301437_page_seg_1.json'], 'all_ocr_files_web': ['ocr/web/16301437_page_seg_0.json', 'ocr/web/16301437_page_seg_1.json'], 'all_table_files_textract': ['tsv/textract/16301437_page_0.tsv', 'tsv/textract/16301437_page_1.tsv'], 'all_table_files_web': ['tsv/web/16301437_1.tsv', 'tsv/web/16301437_0.tsv'], 'answer': [['don johnson buckeye st. classic']], 'image': '16301437_page.png', 'ocr_retrieval_file_textract': 'ocr/textract/16301437_page_seg_0.json', 'ocr_retrieval_file_web': 'ocr/web/16301437_page_seg_0.json', 'question': 'Name the Event which has a Score of 209-197?', 'sample_id': '14190', 'sql_str': "SELECT `event` FROM cur_table WHERE `score` = '209-197' ", 'sub_page': ['16301437_page_seg_0.png', '16301437_page_seg_1.png'], 'sub_page_retrieved': '16301437_page_seg_0.png', 'subset': 'TFC', 'table_id': '2-16301437-1', 'table_retrieval_file_textract': 'tsv/textract/16301437_page_0.tsv', 'table_retrieval_file_web': 'tsv/web/16301437_1.tsv'} ``` ### Table Detection annotation example Here is an example of an xml table bbox annotation from `WikiDT-dataset/WikiTableExtraction/structure/[train|test|val]/`. ```xml <annotation> <folder /> <filename>204_147_page_crop_5.png</filename> <source>WikiDT Dataset</source> <size> <width>788</width> <height>540.0</height> <depth>3</depth> </size> <object> <name>table</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>10</ymin> <xmax>778</xmax> <ymax>530</ymax> </bndbox> </object> <object> <name>header row</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>10</ymin> <xmax>778</xmax> <ymax>33</ymax> </bndbox> </object> <object> <name>header cell</name> <rowspan /> <colspan>10</colspan> <bndbox> <xmin>12</xmin> <ymin>35</ymin> <xmax>776</xmax> <ymax>58</ymax> </bndbox> </object> <object> <name>table row</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>60</ymin> <xmax>778</xmax> <ymax>530</ymax> </bndbox> </object> </annotation> ``` ### Licensing Information CC BY SA 3.0 ### Contributors [Hui Shi](mailto:hshi@ucsd.edu) (Work done during her internship at Amazon) [Yusheng Xie](mailto:yushx@amazon.com) (corresponding person) [Luis Goncalves](mailto:luisgonc@amazon.com)

许可证：知识共享署名-相同方式共享3.0（CC BY-SA 3.0）任务类别： - 表格问答（table-question-answering） - 问答（question-answering）语言： - 英语（en）标签： - 文档（documents） - 表格（tables） - 视觉问答（VQA，Visual Question Answering）美观名称：WikiDT 规模类别： - 100K<n<1M --- # WikiDT：面向表格提取与视觉问答的维基百科表格文档数据集 ## 数据集说明 - **主页：** 无 - **代码仓库：** 无 - **论文：** 无 - **排行榜：** 无 - **联系人：** 无 ### 数据集概述 WikiDT包含面向基于图像的问答任务的多层级标注与标签。由于问题需基于图像中的某张表格作答，且WikiDT提供了表格标注以助力模型诊断与任务拆解，该数据集亦可直接作为表格识别数据集使用。本数据集包含16,887张维基百科页面截图，因完整截图可能过长，已被分割为54,032个子页面。数据集中共计159,905张表格，问答样本总量为70,652条。每条问答样本均包含<问题，答案，完整页面截图文件名>三元组，并额外标注了检索标签（所属子页面与对应表格）。其中53,698条问答样本还带有SQL标注。针对每个子页面，数据集提供了来自两个来源的光学字符识别（OCR，Optical Character Recognition）与表格提取标注。在渲染截图的过程中，我们记录了真实表格标注；同时，为提升数据集的真实性，我们还针对每个子页面调用了[Amazon Textract](https://aws.amazon.com/textract/)进行OCR与表格提取（结果获取于2023年2月28日至2023年3月6日期间）。 ### 语言英语 ## 数据集结构下载完成后，WikiDT包含以下内容。下载的文件总大小约为77GB，因需从分卷压缩包中提取单个文件，请确保您拥有至少160GB的存储空间。 . ├── WikiTableExtraction │ ├── detection.partaa │ ├── detection.partab │ ├── detection.partac │ ├── detection.partad │ ├── detection.partae │ ├── detection.partaf │ ├── detection.partag │ ├── structure.partaa │ ├── structure.partab │ ├── structure.partac │ ├── structure.partad │ └── structure.partae ├── images.partaa ├── images.partab ├── images.partac ├── images.partad ├── images.partae ├── images.partaf ├── images.partag ├── images.partah ├── images.partai ├── ocr.tar ├── samples │ ├── test.json │ ├── train.json │ └── val.json └── tsv.tar 请将分卷文件拼接后解压至对应文件夹。例如，运行以下命令： cd WikiTableExtraction/ cat detection.parta* | tar x 即可解压出`detection`文件夹。完成所有分卷压缩包的解压后，WikiDT数据集将拥有如下文件结构： sh +--WikiDT-dataset | +--WikiTableExtraction | | +--detection | | | +--images # 子页面图像 | | | +--train # 表格边界框XML标注（训练集） | | | +--test # 表格边界框XML标注（测试集） | | | +--val # 表格边界框XML标注（验证集） | | | images_filelist.txt # 54,032张图像的索引文件 | | | test_filelist.txt # 5,410条测试样本的索引文件 | | | train_filelist.txt # 43,248条训练样本的索引文件 | | | val_filelist.txt # 5,347条验证样本的索引文件 | | +--structure | | | +--images # 裁剪至表格区域的图像 | | | +--train # 表格边界框XML标注（训练集） | | | +--test # 表格边界框XML标注（测试集） | | | +--val # 表格边界框XML标注（验证集） | | | images_filelist.txt # 159,898张图像的索引文件 | | | test_filelist.txt # 15,989条测试样本的索引文件 | | | train_filelist.txt # 129,980条训练样本的索引文件 | | | val_filelist.txt # 15,991条验证样本的索引文件 | +--samples # 三个JSON文件共计70,652条TableVQA样本 | | +--train.json # 训练集样本 | | +--test.json # 测试集样本 | | +--val.json # 验证集样本 | +--images # 完整页面图像 | +--ocr # 表格内容的文本与边界框信息 | | +--textract # 由Amazon Textract API检测得到的结果 | | +--web # 从HTML信息中提取得到的结果 | +--tsv # TSV格式的提取表格 | | +--textract # 由Amazon Textract API检测得到的结果 | | +--web # 从HTML信息中提取得到的结果 ### 表格视觉问答标注示例以下是来自`WikiDT-dataset/samples/[train|test|val].json/`的TableVQA标注示例： {'all_ocr_files_textract': ['ocr/textract/16301437_page_seg_0.json', 'ocr/textract/16301437_page_seg_1.json'], 'all_ocr_files_web': ['ocr/web/16301437_page_seg_0.json', 'ocr/web/16301437_page_seg_1.json'], 'all_table_files_textract': ['tsv/textract/16301437_page_0.tsv', 'tsv/textract/16301437_page_1.tsv'], 'all_table_files_web': ['tsv/web/16301437_1.tsv', 'tsv/web/16301437_0.tsv'], 'answer': [['don johnson buckeye st. classic']], 'image': '16301437_page.png', 'ocr_retrieval_file_textract': 'ocr/textract/16301437_page_seg_0.json', 'ocr_retrieval_file_web': 'ocr/web/16301437_page_seg_0.json', 'question': 'Name the Event which has a Score of 209-197?', 'sample_id': '14190', 'sql_str': "SELECT `event` FROM cur_table WHERE `score` = '209-197' ", 'sub_page': ['16301437_page_seg_0.png', '16301437_page_seg_1.png'], 'sub_page_retrieved': '16301437_page_seg_0.png', 'subset': 'TFC', 'table_id': '2-16301437-1', 'table_retrieval_file_textract': 'tsv/textract/16301437_page_0.tsv', 'table_retrieval_file_web': 'tsv/web/16301437_1.tsv'} ### 表格检测标注示例以下是来自`WikiDT-dataset/WikiTableExtraction/structure/[train|test|val]/`的XML表格边界框标注示例： xml <annotation> <folder /> <filename>204_147_page_crop_5.png</filename> <source>WikiDT Dataset</source> <size> <width>788</width> <height>540.0</height> <depth>3</depth> </size> <object> <name>table</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>10</ymin> <xmax>778</xmax> <ymax>530</ymax> </bndbox> </object> <object> <name>header row</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>10</ymin> <xmax>778</xmax> <ymax>33</ymax> </bndbox> </object> <object> <name>header cell</name> <rowspan /> <colspan>10</colspan> <bndbox> <xmin>12</xmin> <ymin>35</ymin> <xmax>776</xmax> <ymax>58</ymax> </bndbox> </object> <object> <name>table row</name> <rowspan /> <colspan /> <bndbox> <xmin>10</xmin> <ymin>60</ymin> <xmax>778</xmax> <ymax>530</ymax> </bndbox> </object> </annotation> ### 许可信息知识共享署名-相同方式共享3.0（CC BY-SA 3.0） ### 贡献者 [Hui Shi](mailto:hshi@ucsd.edu)（亚马逊实习期间完成的工作） [Yusheng Xie](mailto:yushx@amazon.com)（通讯联系人） [Luis Goncalves](mailto:luisgonc@amazon.com)

提供机构：

AmazonScience

原始信息汇总

数据集概述

名称： WikiDT

任务类别：

表格问答
问答

语言： 英语

标签：

文档
表格
视觉问答（VQA）

数据集大小： 100K<n<1M

数据集内容

数据集组成：

包含16,887张Wikipedia截图，分割为54,032个子页面。
总计有159,905个表格。
包含70,652个问答样本，每个样本包含<question, answer, full-page screenshot filename>三元组，并额外标注了检索标签（子页面和表格）。
53,698个问答样本具有SQL注释。

注释信息：

每个子页面提供两种来源的OCR和表格提取注释。
地真表格注释在截图渲染时记录。
为增加数据集的真实性，还请求了Amazon Textract的OCR和表格提取（结果获取时间为2023年2月28日至3月6日）。

数据集结构

下载文件大小： 约77GB，建议至少有160GB空间以解压文件。

文件结构：

. ├── WikiTableExtraction │ ├── detection │ └── structure ├── images ├── ocr └── samples ├── train.json ├── test.json └── val.json

详细结构：

WikiTableExtraction 包含检测和结构部分的子文件夹，每个子文件夹下有训练、测试和验证集的XML表格边界框注释。
images 包含全页面图像。
ocr 包含表格内容的文本和边界框，分为Amazon Textract和网页提取两部分。
samples 包含训练、测试和验证集的JSON文件，总计70,652个TableVQA样本。

注释示例

表格VQA注释示例：

包含问题、答案、图像文件名、SQL字符串等详细信息。

表格检测注释示例：

使用XML格式，包含表格、表头行、表头单元格和表格行的边界框信息。

许可证信息

许可证： CC BY SA 3.0

搜集汇总

数据集介绍

背景与挑战

背景概述

WikiDT是一个用于表格提取和视觉问答的维基百科数据集，包含大量截图、表格和问答样本，并提供多级注释和标签。该数据集适用于模型诊断和问题分解，支持表格识别和问答任务。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集