stanford-crfm/image2struct-latex-v1
收藏Hugging Face2024-08-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/stanford-crfm/image2struct-latex-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 1K<n<10K
task_categories:
- question-answering
- visual-question-answering
pretty_name: Image2Structure - Latex
dataset_info:
- config_name: algorithm
features:
- name: structure
dtype: string
- name: text
dtype: string
- name: image
dtype: image
- name: download_url
dtype: string
- name: instance_name
dtype: string
- name: date
dtype: string
- name: additional_info
dtype: string
- name: date_scrapped
dtype: string
- name: file_filters
dtype: string
- name: compilation_info
dtype: string
- name: rendering_filters
dtype: string
- name: assets
sequence: string
- name: category
dtype: string
- name: uuid
dtype: string
- name: length
dtype: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 35687268.0
num_examples: 300
download_size: 33800484
dataset_size: 35687268.0
- config_name: equation
features:
- name: structure
dtype: string
- name: text
dtype: string
- name: image
dtype: image
- name: download_url
dtype: string
- name: instance_name
dtype: string
- name: date
dtype: string
- name: additional_info
dtype: string
- name: date_scrapped
dtype: string
- name: file_filters
dtype: string
- name: compilation_info
dtype: string
- name: rendering_filters
dtype: string
- name: assets
sequence: string
- name: category
dtype: string
- name: uuid
dtype: string
- name: length
dtype: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 6048536.0
num_examples: 300
download_size: 4696512
dataset_size: 6048536.0
- config_name: plot
features:
- name: structure
dtype: string
- name: text
dtype: string
- name: image
dtype: image
- name: download_url
dtype: string
- name: instance_name
dtype: string
- name: date
dtype: string
- name: additional_info
dtype: string
- name: date_scrapped
dtype: string
- name: file_filters
dtype: string
- name: compilation_info
dtype: string
- name: rendering_filters
dtype: string
- name: assets
sequence: string
- name: category
dtype: string
- name: uuid
dtype: string
- name: length
dtype: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 12245318.0
num_examples: 300
download_size: 8209981
dataset_size: 12245318.0
- config_name: table
features:
- name: structure
dtype: string
- name: text
dtype: string
- name: image
dtype: image
- name: download_url
dtype: string
- name: instance_name
dtype: string
- name: date
dtype: string
- name: additional_info
dtype: string
- name: date_scrapped
dtype: string
- name: file_filters
dtype: string
- name: compilation_info
dtype: string
- name: rendering_filters
dtype: string
- name: assets
sequence: string
- name: category
dtype: string
- name: uuid
dtype: string
- name: length
dtype: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 30860645.0
num_examples: 300
download_size: 29140278
dataset_size: 30860645.0
- config_name: wild
features:
- name: image
dtype: image
- name: additional_info
dtype: string
- name: assets
sequence: string
- name: category
dtype: string
- name: uuid
dtype: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 163753.0
num_examples: 2
download_size: 157850
dataset_size: 163753.0
- config_name: wild_legacy
features:
- name: image
dtype: image
- name: url
dtype: string
- name: instance_name
dtype: string
- name: date_scrapped
dtype: string
- name: uuid
dtype: string
- name: category
dtype: string
- name: additional_info
dtype: string
- name: assets
sequence: string
- name: difficulty
dtype: string
splits:
- name: validation
num_bytes: 497129.0
num_examples: 50
download_size: 496777
dataset_size: 497129.0
configs:
- config_name: algorithm
data_files:
- split: validation
path: algorithm/validation-*
- config_name: equation
data_files:
- split: validation
path: equation/validation-*
- config_name: plot
data_files:
- split: validation
path: plot/validation-*
- config_name: table
data_files:
- split: validation
path: table/validation-*
- config_name: wild
data_files:
- split: validation
path: wild/validation-*
- config_name: wild_legacy
data_files:
- split: validation
path: wild_legacy/validation-*
tags:
- biology
- finance
- economics
- math
- physics
- computer_science
- electronics
- statistics
---
# Image2Struct - Latex
[Paper](TODO) | [Website](https://crfm.stanford.edu/helm/image2structure/latest/) | Datasets ([Webpages](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [Latex](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [Music sheets](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)) | [Leaderboard](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM repo](https://github.com/stanford-crfm/helm) | [Image2Struct repo](https://github.com/stanford-crfm/image2structure)
**License:** [Apache License](http://www.apache.org/licenses/) Version 2.0, January 2004
## Dataset description
Image2struct is a benchmark for evaluating vision-language models in practical tasks of extracting structured information from images.
This subdataset focuses on LaTeX code. The model is given an image of the expected output with the prompt:
```Please provide the LaTex code used to generate this image. Only generate the code relevant to what you see. Your code will be surrounded by all the imports necessary as well as the begin and end document delimiters.```
The subjects were collected on ArXiv and are: eess, cs, stat, math, physics, econ, q-bio, q-fin.
The dataset is divided into 5 categories. There are 4 categories that are collected automatically using the [Image2Struct repo](https://github.com/stanford-crfm/image2structure):
* equations
* tables
* algorithms
* code
The last category: **wild**, was collected by taking screenshots of equations in the Wikipedia page of "equation" and its related pages.
## Uses
To load the subset `equation` of the dataset to be sent to the model under evaluation in Python:
```python
import datasets
datasets.load_dataset("stanford-crfm/i2s-latex", "equation", split="validation")
```
To evaluate a model on Image2Latex (equation) using [HELM](https://github.com/stanford-crfm/helm/), run the following command-line commands:
```sh
pip install crfm-helm
helm-run --run-entries image2latex:subset=equation,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10
```
You can also run the evaluation for only a specific `subset` and `difficulty`:
```sh
helm-run --run-entries image2latex:subset=equation,difficulty=hard,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10
```
For more information on running Image2Struct using [HELM](https://github.com/stanford-crfm/helm/), refer to the [HELM documentation](https://crfm-helm.readthedocs.io/) and the article on [reproducing leaderboards](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
## Citation
**BibTeX:**
```tex
@misc{roberts2024image2struct,
title={Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images},
author={Josselin Somerville Roberts and Tony Lee and Chi Heem Wong and Michihiro Yasunaga and Yifan Mai and Percy Liang},
year={2024},
eprint={TBD},
archivePrefix={arXiv},
primaryClass={TBD}
}
```
**元数据:**
- 语言:英语(en)
- 许可证:Apache 2.0
- 样本规模:1K<n<10K
- 任务类别:问答(question-answering)、视觉问答(visual-question-answering)
- 美观名称:Image2Struct - LaTeX(LaTeX)
## 数据集信息
各配置详情如下:
### 配置名称:algorithm(算法)
字段信息:
- structure(结构):字符串类型
- text(文本):字符串类型
- image(图像):图像类型
- download_url(下载链接):字符串类型
- instance_name(实例名称):字符串类型
- date(日期):字符串类型
- additional_info(附加信息):字符串类型
- date_scrapped(爬取日期):字符串类型
- file_filters(文件过滤器):字符串类型
- compilation_info(编译信息):字符串类型
- rendering_filters(渲染过滤器):字符串类型
- assets(资源):字符串序列
- category(类别):字符串类型
- uuid(通用唯一识别码):字符串类型
- length(长度):字符串类型
- difficulty(难度):字符串类型
划分集:
- 验证集(validation):数据量35687268.0字节,共300个示例
下载总大小:33800484字节,数据集总大小:35687268.0字节
### 配置名称:equation(公式)
字段信息与algorithm配置一致,划分集:
- 验证集:6048536.0字节,共300个示例
下载总大小:4696512字节,数据集总大小:6048536.0字节
### 配置名称:plot(绘图)
字段信息与algorithm配置一致,划分集:
- 验证集:12245318.0字节,共300个示例
下载总大小:8209981字节,数据集总大小:12245318.0字节
### 配置名称:table(表格)
字段信息与algorithm配置一致,划分集:
- 验证集:30860645.0字节,共300个示例
下载总大小:29140278字节,数据集总大小:30860645.0字节
### 配置名称:wild(野外采集)
字段信息:
- image(图像):图像类型
- additional_info(附加信息):字符串类型
- assets(资源):字符串序列
- category(类别):字符串类型
- uuid(通用唯一识别码):字符串类型
- difficulty(难度):字符串类型
划分集:
- 验证集:163753.0字节,共2个示例
下载总大小:157850字节,数据集总大小:163753.0字节
### 配置名称:wild_legacy(旧版野外采集)
字段信息:
- image(图像):图像类型
- url(链接):字符串类型
- instance_name(实例名称):字符串类型
- date_scrapped(爬取日期):字符串类型
- uuid(通用唯一识别码):字符串类型
- category(类别):字符串类型
- additional_info(附加信息):字符串类型
- assets(资源):字符串序列
- difficulty(难度):字符串类型
划分集:
- 验证集:497129.0字节,共50个示例
下载总大小:496777字节,数据集总大小:497129.0字节
## 配置文件
各配置对应数据文件:
- algorithm配置:验证集数据路径为algorithm/validation-*
- equation配置:验证集数据路径为equation/validation-*
- plot配置:验证集数据路径为plot/validation-*
- table配置:验证集数据路径为table/validation-*
- wild配置:验证集数据路径为wild/validation-*
- wild_legacy配置:验证集数据路径为wild_legacy/validation-*
## 标签
生物学、金融学、经济学、数学、物理学、计算机科学、电子学、统计学
# Image2Struct - LaTeX(LaTeX)
[论文](TODO) | [官网](https://crfm.stanford.edu/helm/image2structure/latest/) | 数据集([网页版](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [LaTeX版](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [乐谱版](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)) | [排行榜](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM仓库](https://github.com/stanford-crfm/helm) | [Image2Struct仓库](https://github.com/stanford-crfm/image2structure)
**许可证:** [Apache许可证](http://www.apache.org/licenses/) 版本2.0,2004年1月
## 数据集说明
Image2Struct(Image2Struct)是一款用于评估视觉语言模型(vision-language model)从图像中提取结构化信息这一实际任务的基准测试集。本子数据集专注于LaTeX(LaTeX)代码。模型将收到预期输出的图像,并伴随以下提示词:
请提供用于生成该图像的LaTeX代码。仅输出与你所见内容相关的代码。你的代码将被包含所有必要导入语句以及文档开始、结束分隔符的完整代码块中。
数据集的主题采集自ArXiv,涵盖领域包括:eess(电子工程与信号处理)、cs(计算机科学)、stat(统计学)、math(数学)、physics(物理学)、econ(经济学)、q-bio(定量生物学)、q-fin(定量金融学)。
本数据集共分为5个类别,其中4个类别通过Image2Struct(Image2Struct)代码库自动采集:
- equations(公式)
- tables(表格)
- algorithms(算法)
- code(代码)
最后一个类别:**wild(野外采集/非结构化)**,通过截取"方程"维基百科页面及其相关页面中的公式截图采集得到。
## 使用方式
要在Python中加载该数据集的`equation`子集,以供待评估模型使用,请执行以下代码:
python
import datasets
datasets.load_dataset("stanford-crfm/i2s-latex", "equation", split="validation")
若要使用HELM(HELM)在Image2LaTeX(Image2LaTeX,即equation子集)上评估模型,请运行以下命令行指令:
sh
pip install crfm-helm
helm-run --run-entries image2latex:subset=equation,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10
你也可以仅针对特定的`subset`(子集)与`difficulty`(难度)运行评估:
sh
helm-run --run-entries image2latex:subset=equation,difficulty=hard,model=vlm --models-to-run google/gemini-pro-vision --suite my-suite-i2s --max-eval-instances 10
若需了解更多关于使用HELM运行Image2Struct的相关信息,请参阅HELM官方文档以及复现排行榜的相关文章。
## 引用格式
**BibTeX格式:**
tex
@misc{roberts2024image2struct,
title={Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images},
author={Josselin Somerville Roberts and Tony Lee and Chi Heem Wong and Michihiro Yasunaga and Yifan Mai and Percy Liang},
year={2024},
eprint={TBD},
archivePrefix={arXiv},
primaryClass={TBD}
}
提供机构:
stanford-crfm



