stanford-crfm/image2struct-latex-v1

Name: stanford-crfm/image2struct-latex-v1
Creator: stanford-crfm
Published: 2024-08-01 11:00:43
License: 暂无描述

Hugging Face2024-08-01 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/stanford-crfm/image2struct-latex-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 1K<n<10K task_categories: - question-answering - visual-question-answering pretty_name: Image2Structure - Latex dataset_info: - config_name: algorithm features: - name: structure dtype: string - name: text dtype: string - name: image dtype: image - name: download_url dtype: string - name: instance_name dtype: string - name: date dtype: string - name: additional_info dtype: string - name: date_scrapped dtype: string - name: file_filters dtype: string - name: compilation_info dtype: string - name: rendering_filters dtype: string - name: assets sequence: string - name: category dtype: string - name: uuid dtype: string - name: length dtype: string - name: difficulty dtype: string splits: - name: validation num_bytes: 35687268.0 num_examples: 300 download_size: 33800484 dataset_size: 35687268.0 - config_name: equation features: - name: structure dtype: string - name: text dtype: string - name: image dtype: image - name: download_url dtype: string - name: instance_name dtype: string - name: date dtype: string - name: additional_info dtype: string - name: date_scrapped dtype: string - name: file_filters dtype: string - name: compilation_info dtype: string - name: rendering_filters dtype: string - name: assets sequence: string - name: category dtype: string - name: uuid dtype: string - name: length dtype: string - name: difficulty dtype: string splits: - name: validation num_bytes: 6048536.0 num_examples: 300 download_size: 4696512 dataset_size: 6048536.0 - config_name: plot features: - name: structure dtype: string - name: text dtype: string - name: image dtype: image - name: download_url dtype: string - name: instance_name dtype: string - name: date dtype: string - name: additional_info dtype: string - name: date_scrapped dtype: string - name: file_filters dtype: string - name: compilation_info dtype: string - name: rendering_filters dtype: string - name: assets sequence: string - name: category dtype: string - name: uuid dtype: string - name: length dtype: string - name: difficulty dtype: string splits: - name: validation num_bytes: 12245318.0 num_examples: 300 download_size: 8209981 dataset_size: 12245318.0 - config_name: table features: - name: structure dtype: string - name: text dtype: string - name: image dtype: image - name: download_url dtype: string - name: instance_name dtype: string - name: date dtype: string - name: additional_info dtype: string - name: date_scrapped dtype: string - name: file_filters dtype: string - name: compilation_info dtype: string - name: rendering_filters dtype: string - name: assets sequence: string - name: category dtype: string - name: uuid dtype: string - name: length dtype: string - name: difficulty dtype: string splits: - name: validation num_bytes: 30860645.0 num_examples: 300 download_size: 29140278 dataset_size: 30860645.0 - config_name: wild features: - name: image dtype: image - name: additional_info dtype: string - name: assets sequence: string - name: category dtype: string - name: uuid dtype: string - name: difficulty dtype: string splits: - name: validation num_bytes: 163753.0 num_examples: 2 download_size: 157850 dataset_size: 163753.0 - config_name: wild_legacy features: - name: image dtype: image - name: url dtype: string - name: instance_name dtype: string - name: date_scrapped dtype: string - name: uuid dtype: string - name: category dtype: string - name: additional_info dtype: string - name: assets sequence: string - name: difficulty dtype: string splits: - name: validation num_bytes: 497129.0 num_examples: 50 download_size: 496777 dataset_size: 497129.0 configs: - config_name: algorithm data_files: - split: validation path: algorithm/validation-* - config_name: equation data_files: - split: validation path: equation/validation-* - config_name: plot data_files: - split: validation path: plot/validation-* - config_name: table data_files: - split: validation path: table/validation-* - config_name: wild data_files: - split: validation path: wild/validation-* - config_name: wild_legacy data_files: - split: validation path: wild_legacy/validation-* tags: - biology - finance - economics - math - physics - computer_science - electronics - statistics --- # Image2Struct - Latex [Paper](TODO) | [Website](https://crfm.stanford.edu/helm/image2structure/latest/) | Datasets ([Webpages](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [Latex](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [Music sheets](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)) | [Leaderboard](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM repo](https://github.com/stanford-crfm/helm) | [Image2Struct repo](https://github.com/stanford-crfm/image2structure) **License:** [Apache License](http://www.apache.org/licenses/) Version 2.0, January 2004 ## Dataset description Image2struct is a benchmark for evaluating vision-language models in practical tasks of extracting structured information from images. This subdataset focuses on LaTeX code. The model is given an image of the expected output with the prompt: ```Please provide the LaTex code used to generate this image. Only generate the code relevant to what you see. Your code will be surrounded by all the imports necessary as well as the begin and end document delimiters.``` The subjects were collected on ArXiv and are: eess, cs, stat, math, physics, econ, q-bio, q-fin. The dataset is divided into 5 categories. There are 4 categories that are collected automatically using the [Image2Struct repo](https://github.com/stanford-crfm/image2structure): * equations * tables * algorithms * code The last category: **wild**, was collected by taking screenshots of equations in the Wikipedia page of "equation" and its related pages. ## Uses To load the subset `equation` of the dataset to be sent to the model under evaluation in Python: ```python import datasets datasets.load_dataset("stanford-crfm/i2s-latex", "equation", split="validation") ``` To evaluate a model on Image2Latex (equation) using [HELM](https://github.com/stanford-crfm/helm/), run the following command-line commands: ```sh pip install crfm-helm helm-run --run-entries image2latex:subset=equation,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10 ``` You can also run the evaluation for only a specific `subset` and `difficulty`: ```sh helm-run --run-entries image2latex:subset=equation,difficulty=hard,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10 ``` For more information on running Image2Struct using [HELM](https://github.com/stanford-crfm/helm/), refer to the [HELM documentation](https://crfm-helm.readthedocs.io/) and the article on [reproducing leaderboards](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/). ## Citation **BibTeX:** ```tex @misc{roberts2024image2struct, title={Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images}, author={Josselin Somerville Roberts and Tony Lee and Chi Heem Wong and Michihiro Yasunaga and Yifan Mai and Percy Liang}, year={2024}, eprint={TBD}, archivePrefix={arXiv}, primaryClass={TBD} } ```

**元数据：** - 语言：英语（en） - 许可证：Apache 2.0 - 样本规模：1K<n<10K - 任务类别：问答（question-answering）、视觉问答（visual-question-answering） - 美观名称：Image2Struct - LaTeX(LaTeX) ## 数据集信息各配置详情如下： ### 配置名称：algorithm（算法）字段信息： - structure（结构）：字符串类型 - text（文本）：字符串类型 - image（图像）：图像类型 - download_url（下载链接）：字符串类型 - instance_name（实例名称）：字符串类型 - date（日期）：字符串类型 - additional_info（附加信息）：字符串类型 - date_scrapped（爬取日期）：字符串类型 - file_filters（文件过滤器）：字符串类型 - compilation_info（编译信息）：字符串类型 - rendering_filters（渲染过滤器）：字符串类型 - assets（资源）：字符串序列 - category（类别）：字符串类型 - uuid（通用唯一识别码）：字符串类型 - length（长度）：字符串类型 - difficulty（难度）：字符串类型划分集： - 验证集（validation）：数据量35687268.0字节，共300个示例下载总大小：33800484字节，数据集总大小：35687268.0字节 ### 配置名称：equation（公式）字段信息与algorithm配置一致，划分集： - 验证集：6048536.0字节，共300个示例下载总大小：4696512字节，数据集总大小：6048536.0字节 ### 配置名称：plot（绘图）字段信息与algorithm配置一致，划分集： - 验证集：12245318.0字节，共300个示例下载总大小：8209981字节，数据集总大小：12245318.0字节 ### 配置名称：table（表格）字段信息与algorithm配置一致，划分集： - 验证集：30860645.0字节，共300个示例下载总大小：29140278字节，数据集总大小：30860645.0字节 ### 配置名称：wild（野外采集）字段信息： - image（图像）：图像类型 - additional_info（附加信息）：字符串类型 - assets（资源）：字符串序列 - category（类别）：字符串类型 - uuid（通用唯一识别码）：字符串类型 - difficulty（难度）：字符串类型划分集： - 验证集：163753.0字节，共2个示例下载总大小：157850字节，数据集总大小：163753.0字节 ### 配置名称：wild_legacy（旧版野外采集）字段信息： - image（图像）：图像类型 - url（链接）：字符串类型 - instance_name（实例名称）：字符串类型 - date_scrapped（爬取日期）：字符串类型 - uuid（通用唯一识别码）：字符串类型 - category（类别）：字符串类型 - additional_info（附加信息）：字符串类型 - assets（资源）：字符串序列 - difficulty（难度）：字符串类型划分集： - 验证集：497129.0字节，共50个示例下载总大小：496777字节，数据集总大小：497129.0字节 ## 配置文件各配置对应数据文件： - algorithm配置：验证集数据路径为algorithm/validation-* - equation配置：验证集数据路径为equation/validation-* - plot配置：验证集数据路径为plot/validation-* - table配置：验证集数据路径为table/validation-* - wild配置：验证集数据路径为wild/validation-* - wild_legacy配置：验证集数据路径为wild_legacy/validation-* ## 标签生物学、金融学、经济学、数学、物理学、计算机科学、电子学、统计学 # Image2Struct - LaTeX(LaTeX) [论文](TODO) | [官网](https://crfm.stanford.edu/helm/image2structure/latest/) | 数据集（[网页版](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [LaTeX版](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [乐谱版](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)） | [排行榜](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM仓库](https://github.com/stanford-crfm/helm) | [Image2Struct仓库](https://github.com/stanford-crfm/image2structure) **许可证：** [Apache许可证](http://www.apache.org/licenses/) 版本2.0，2004年1月 ## 数据集说明 Image2Struct(Image2Struct)是一款用于评估视觉语言模型（vision-language model）从图像中提取结构化信息这一实际任务的基准测试集。本子数据集专注于LaTeX(LaTeX)代码。模型将收到预期输出的图像，并伴随以下提示词：请提供用于生成该图像的LaTeX代码。仅输出与你所见内容相关的代码。你的代码将被包含所有必要导入语句以及文档开始、结束分隔符的完整代码块中。数据集的主题采集自ArXiv，涵盖领域包括：eess（电子工程与信号处理）、cs（计算机科学）、stat（统计学）、math（数学）、physics（物理学）、econ（经济学）、q-bio（定量生物学）、q-fin（定量金融学）。本数据集共分为5个类别，其中4个类别通过Image2Struct(Image2Struct)代码库自动采集： - equations（公式） - tables（表格） - algorithms（算法） - code（代码）最后一个类别：**wild（野外采集/非结构化）**，通过截取"方程"维基百科页面及其相关页面中的公式截图采集得到。 ## 使用方式要在Python中加载该数据集的`equation`子集，以供待评估模型使用，请执行以下代码： python import datasets datasets.load_dataset("stanford-crfm/i2s-latex", "equation", split="validation") 若要使用HELM(HELM)在Image2LaTeX(Image2LaTeX，即equation子集)上评估模型，请运行以下命令行指令： sh pip install crfm-helm helm-run --run-entries image2latex:subset=equation,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10 你也可以仅针对特定的`subset`（子集）与`difficulty`（难度）运行评估： sh helm-run --run-entries image2latex:subset=equation,difficulty=hard,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10 若需了解更多关于使用HELM运行Image2Struct的相关信息，请参阅HELM官方文档以及复现排行榜的相关文章。 ## 引用格式 **BibTeX格式：** tex @misc{roberts2024image2struct, title={Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images}, author={Josselin Somerville Roberts and Tony Lee and Chi Heem Wong and Michihiro Yasunaga and Yifan Mai and Percy Liang}, year={2024}, eprint={TBD}, archivePrefix={arXiv}, primaryClass={TBD} }

提供机构：

stanford-crfm

5,000+

优质数据集

54 个

任务类型

进入经典数据集