Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"

Figshare2025-03-31 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Data_and_Supplementary_Information_for_b_i_Leveraging_Vision_Capabilities_of_Multimodal_LLMs_for_Automated_Data_Extraction_from_Plots_i_b_/28559639

下载链接

链接失效反馈

官方服务：

资源简介：

Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.

创建时间：

2025-03-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集