Paper2Fig100k dataset

Mendeley Data2024-05-10 更新2024-06-29 收录

下载链接：

https://zenodo.org/records/7299423

下载链接

链接失效反馈

资源简介：

Paper2Fig100k dataset A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts). The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure: figure_id: Figure identification based on the arXiv identifier: <yymm>.<xxxxxx>-Figure<I>-<k>.png. captions: Text pairs extracted from the paper that relates to the figure. For instance, the actual caption of the figure or references to the figure in the manuscript. ocr_result: Result of performing OCR text recognition over the image. We provide a list of triplets (bounding box, confidence, text) present in the image. aspect: Aspect ratio of the image (H/W). Take a look at the OCR-VQGAN GitHub repository, which uses the Paper2Fig100k dataset to train an image encoder for figures and diagrams, that uses OCR perceptual loss to render clear and readable texts inside images. The dataset is explained in more detail in the paper OCR-VQGAN: Taming Text-within-Image Generation @WACV 2023 Paper abstract Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the superiority of our method by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function.

创建时间：

2023-06-28

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

The MaizeGDB

The MaizeGDB（Maize Genetics and Genomics Database）是一个专门为玉米（Zea mays）基因组学研究提供数据和工具的在线资源。该数据库包含了玉米的基因组序列、基因注释、遗传图谱、突变体信息、表达数据、以及与玉米相关的文献和研究工具。MaizeGDB旨在支持玉米遗传学和基因组学的研究，为科学家提供了一个集成的平台来访问和分析玉米的遗传和基因组数据。

www.maizegdb.org 收录

MeSH

MeSH（医学主题词表）是一个用于索引和检索生物医学文献的标准化词汇表。它包含了大量的医学术语和概念，用于描述医学文献中的主题和内容。MeSH数据集包括主题词、副主题词、树状结构、历史记录等信息，广泛应用于医学文献的分类和检索。

www.nlm.nih.gov 收录

ERIC (Education Resources Information Center)

ERIC (Education Resources Information Center) 是一个广泛的教育文献数据库，包含超过130万条记录，涵盖从1966年至今的教育研究、政策和实践。数据集内容包括教育相关的期刊文章、书籍、研究报告、会议论文、技术报告、政策文件等。

eric.ed.gov 收录

全国兴趣点（POI）数据

POI（Point of Interest），即兴趣点，一个POI可以是餐厅、超市、景点、酒店、车站、停车场等。兴趣点通常包含四方面信息，分别为名称、类别、坐标、分类。其中，分类一般有一级分类和二级分类，每个分类都有相应的行业的代码和名称一一对应。 POI包含的信息及其衍生信息主要包含三个部分：

CnOpenData 收录

UCF-Crime

UCF-犯罪数据集是128小时视频的新型大规模第一个数据集。它包含1900年长而未修剪的真实世界监控视频，其中包含13个现实异常，包括虐待，逮捕，纵火，殴打，道路交通事故，入室盗窃，爆炸，战斗，抢劫，射击，偷窃，入店行窃和故意破坏。之所以选择这些异常，是因为它们对公共安全有重大影响。这个数据集可以用于两个任务。首先，考虑一组中的所有异常和另一组中的所有正常活动的一般异常检测。第二，用于识别13个异常活动中的每一个。

OpenDataLab 收录