five

spiqa

收藏
魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/spiqa
下载链接
链接失效反馈
官方服务:
资源简介:
# SPIQA Dataset Card ## Dataset Details **Dataset Name**: SPIQA (**S**cientific **P**aper **I**mage **Q**uestion **A**nswering) **Paper**: [SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers](https://arxiv.org/abs/2407.09413) **Github**: [SPIQA eval and metrics code repo](https://github.com/google/spiqa) **Dataset Summary**: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers. This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv. If you have any comments or questions, reach out to [Shraman Pramanick](https://shramanpramanick.github.io/) or [Subhashini Venugopalan](https://vsubhashini.github.io/). **Supported Tasks**: - Direct QA with figures and tables - Direct QA with full paper - CoT QA (retrieval of helpful figures, tables; then answering) **Language**: English **Release Date**: SPIQA is released in June 2024. ## Data Splits The statistics of different splits of SPIQA is shown below. | <center>Split</center> | <center>Papers</center> | <center>Questions</center> | <center>Schematics</center> | <center>Plots & Charts</center> | <center>Visualizations</center> | <center>Other figures</center> | <center>Tables</center> | |--------|----------|---------|--------|----------------|-------|--------|----------| | <center>Train</center> | <center>25,459</center> | <center>262,524</center> | <center>44,008</center> | <center>70,041</center> | <center>27,297</center>| <center>6,450</center> | <center>114,728</center> | | <center>Val</center> | <center>200</center> | <center>2,085</center> | <center>360</center> | <center>582</center> | <center>173</center> | <center>55</center> | <center>915</center> | | <center>test-A</center> | <center>118</center> | <center>666</center> | <center>154</center> | <center>301</center> | <center>131</center> | <center>95</center> | <center>434</center> | | <center>test-B</center> | <center>65</center> | <center>228</center> | <center>147</center> | <center>156</center> | <center>133</center> | <center>17</center> | <center>341</center> | | <center>test-C</center> | <center>314</center> | <center>493</center> | <center>415</center> | <center>404</center> | <center>26</center> | <center>66</center> | <center>1,332</center> | ## Dataset Structure The contents of this dataset card are structured as follows: ```bash SPIQA ├── SPIQA_train_val_test-A_extracted_paragraphs.zip ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits ├── SPIQA_train_val_test-A_raw_tex.zip └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research. ├── train_val ├── SPIQA_train_val_Images.zip └── Full resolution figures and tables from the papers in SPIQA train, val splits ├── SPIQA_train.json └── SPIQA train metadata ├── SPIQA_val.json └── SPIQA val metadata ├── test-A ├── SPIQA_testA_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA.json └── SPIQA test-A metadata ├── test-B ├── SPIQA_testB_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB.json └── SPIQA test-B metadata ├── test-C ├── SPIQA_testC_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC.json └── SPIQA test-C metadata ``` The `testA_data_viewer.json` file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata. ## Metadata Structure The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are: - arXiv ID - Semantic scholar ID (for test-B) - Figures and tables - Name of the png file - Caption - Content type (figure or table) - Figure type (schematic, plot, photo (visualization), others) - QAs - Question, answer and rationale - Reference figures and tables - Textual evidence (for test-B and test-C) - Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip) ## Dataset Use and Starter Snippets #### Downloading the Dataset to Local We recommend the users to download the metadata and images to their local machine. - Download the whole dataset (all splits). ```bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path ``` - Download specific file. ```bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path ``` #### Questions and Answers from a Specific Paper in test-A ```bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa']) ``` #### Questions and Answers from a Specific Paper in test-B ```bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## Questions print(testB_metadata[paper_id]['composition']) ## Answers ``` #### Questions and Answers from a Specific Paper in test-C ```bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## Questions print(testC_metadata[paper_id]['answer']) ## Answers ``` ## Annotation Overview Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers. ## Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset. ## Licensing Information CC BY 4.0 ## Citation Information ```bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} } ```

# SPIQA 数据集卡片 ## 数据集详情 **数据集名称**:SPIQA(**S**cientific **P**aper **I**mage **Q**uestion **A**nswering,即科学论文图像问答数据集) **相关论文**:[SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers](https://arxiv.org/abs/2407.09413) **GitHub仓库**:[SPIQA 评测与指标代码仓库](https://github.com/google/spiqa) **数据集概述**:SPIQA是一个大规模且极具挑战性的问答数据集,聚焦于多个计算机科学领域的科研论文中的图表、表格与文本段落。其涵盖的图像类型丰富多样,包含各类绘图、图表、示意图、结果可视化内容等。本数据集经过严谨的筛选流程构建而成,依托多模态大语言模型 (Multimodal Large Language Model, MLLM) 在图像理解方面的广泛专业能力与水平,同时采用自动与人工双重筛选机制以保障最高水准的质量与可靠性。SPIQA包含超过27万个问答样本,划分为训练集、验证集以及三种不同的评测划分集。本数据集旨在评估多模态大语言模型结合科研论文文本段落,理解复杂图表与表格的能力。 本数据集卡片介绍了SPIQA数据集的结构,其划分为训练集、验证集以及三种不同的评测划分集。其中test-B与test-C划分集源自QASA与QASPER数据集,且包含人工撰写的问答对。我们从arXiv上收集了2018年至2023年间于计算机科学顶会发表的全部科研论文。如有任何意见或疑问,请联系[Shraman Pramanick](https://shramanpramanick.github.io/)或[Subhashini Venugopalan](https://vsubhashini.github.io/)。 **支持任务**: - 基于图表与表格的直接问答 - 基于完整论文的直接问答 - 思维链 (Chain of Thought, CoT) 问答(先检索有用的图表与表格,再进行作答) **语言**:英语 **发布日期**:SPIQA于2024年6月发布。 ## 数据划分 SPIQA不同划分集的统计信息如下表所示: | <center>划分集</center> | <center>论文数量</center> | <center>问答样本数</center> | <center>示意图</center> | <center>绘图与图表</center> | <center>可视化内容</center> | <center>其他图像</center> | <center>表格</center> | |--------|----------|---------|--------|----------------|-------|--------|----------| | <center>训练集</center> | <center>25,459</center> | <center>262,524</center> | <center>44,008</center> | <center>70,041</center> | <center>27,297</center> | <center>6,450</center> | <center>114,728</center> | | <center>验证集</center> | <center>200</center> | <center>2,085</center> | <center>360</center> | <center>582</center> | <center>173</center> | <center>55</center> | <center>915</center> | | <center>test-A</center> | <center>118</center> | <center>666</center> | <center>154</center> | <center>301</center> | <center>131</center> | <center>95</center> | <center>434</center> | | <center>test-B</center> | <center>65</center> | <center>228</center> | <center>147</center> | <center>156</center> | <center>133</center> | <center>17</center> | <center>341</center> | | <center>test-C</center> | <center>314</center> | <center>493</center> | <center>415</center> | <center>404</center> | <center>26</center> | <center>66</center> | <center>1,332</center> | ## 数据集结构 本数据集的目录结构如下: bash SPIQA ├── SPIQA_train_val_test-A_extracted_paragraphs.zip │ ├── 从SPIQA训练集、验证集与test-A划分集的论文中提取的文本段落 ├── SPIQA_train_val_test-A_raw_tex.zip │ └── 来自SPIQA训练集、验证集与test-A划分集论文的原始TeX文件。该文件无需用于复现本研究结果,我们将其开源以供未来研究使用。 ├── train_val │ ├── SPIQA_train_val_Images.zip │ │ └── SPIQA训练集与验证集论文中的全分辨率图表与表格 │ ├── SPIQA_train.json │ │ └── SPIQA训练集元数据 │ ├── SPIQA_val.json │ │ └── SPIQA验证集元数据 ├── test-A │ ├── SPIQA_testA_Images.zip │ │ └── SPIQA test-A划分集论文中的全分辨率图表与表格 │ ├── SPIQA_testA_Images_224px.zip │ │ └── SPIQA test-A划分集论文中的224像素分辨率图表与表格 │ ├── SPIQA_testA.json │ │ └── SPIQA test-A划分集元数据 ├── test-B │ ├── SPIQA_testB_Images.zip │ │ └── SPIQA test-B划分集论文中的全分辨率图表与表格 │ ├── SPIQA_testB_Images_224px.zip │ │ └── SPIQA test-B划分集论文中的224像素分辨率图表与表格 │ ├── SPIQA_testB.json │ │ └── SPIQA test-B划分集元数据 ├── test-C │ ├── SPIQA_testC_Images.zip │ │ └── SPIQA test-C划分集论文中的全分辨率图表与表格 │ ├── SPIQA_testC_Images_224px.zip │ │ └── SPIQA test-C划分集论文中的224像素分辨率图表与表格 │ ├── SPIQA_testC.json │ └── SPIQA test-C划分集元数据 其中`testA_data_viewer.json`文件仅用于在HuggingFace查看器中预览部分数据,以便快速了解元数据结构。 ## 元数据结构 所有划分集的元数据均以字典形式提供,字典的键为论文的arXiv ID。每个字典条目的主要内容包括: - arXiv ID - 语义学者 (Semantic Scholar) ID(仅针对test-B划分集) - 图表与表格: - PNG文件名 - 图注 - 内容类型(图表或表格) - 图像类型(示意图、绘图、照片(可视化)、其他) - 问答对: - 问题、答案与依据 - 参考图表与表格 - 文本证据(仅针对test-B与test-C划分集) - 摘要与全文文本(仅针对test-B与test-C划分集;其余划分集的全文文本已打包为压缩包提供) ## 数据集使用与入门示例 ### 本地下载数据集 我们建议用户将元数据与图像下载至本地设备。 - 下载完整数据集(包含所有划分集) bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### 请指定本地目录路径 - 下载指定文件 bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### 请指定本地目录路径 ### 从test-A划分集的特定论文中获取问答对 bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa']) ### 从test-B划分集的特定论文中获取问答对 bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## 问题 print(testB_metadata[paper_id]['composition']) ## 答案 ### 从test-C划分集的特定论文中获取问答对 bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## 问题 print(testC_metadata[paper_id]['answer']) ## 答案 ## 标注概述 SPIQA训练集、验证集与test-A划分集的问答对均为机器生成,且test-A划分集经过人工筛选与整理。SPIQA test-B划分集的问答样本源自QASA数据集,test-C划分集的问答样本源自QASPER数据集。作答所有划分集的问题,都需要结合科研论文的相关文本,对图表与表格进行整体理解。 ## 个人与敏感信息 本数据集未包含任何已知的个人或敏感信息。 ## 授权信息 知识共享署名4.0(CC BY 4.0) ## 引用信息 bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} }
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
SPIQA是一个大规模的科学论文图像问答数据集,包含超过27万个问题,覆盖多种图表类型和表格,用于评估多模态模型在科学论文理解方面的能力。数据集分为训练、验证和三个测试集,支持直接问答和链式推理问答任务。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作