datajuicer/VeriSciQA

Name: datajuicer/VeriSciQA
Creator: datajuicer
Published: 2026-02-08 19:28:22
License: 暂无描述

Hugging Face2026-02-08 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/datajuicer/VeriSciQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - visual-question-answering language: - en tags: - scientific-vqa - vision-language - scientific-figures - multi-choice-qa pretty_name: VeriSciQA size_categories: - 10K<n<100K dataset_info: features: - name: image dtype: image - name: question dtype: string - name: options sequence: string - name: answer dtype: string - name: caption dtype: string - name: figure_type dtype: string - name: image_label dtype: string - name: section dtype: string - name: domain dtype: string - name: question_type dtype: string --- # VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering **Paper**: [VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering](https://arxiv.org/abs/2511.19899) ## Dataset Description VeriSciQA is a large-scale, high-quality dataset for Scientific Visual Question Answering (SVQA), containing 20,272 QA pairs spanning 20 scientific domains, 12 figure types, and 5 question types. The dataset is constructed using a Cross-Modal Verification framework that generates QA pairs from figure-citing paragraphs and verifies their visual grounding against the corresponding figures, leveraging cross-modal consistency to filter out erroneous pairs. ### Key Features - **20,272 QA pairs** covering diverse scientific figures from peer-reviewed papers - **20 scientific domains**: including Computer Science, Physics, Mathematics, Biology, etc. - **12 figure types**: Line plots, Bar charts, Scatter plots, Diagrams, Heatmaps, and more - **5 question types**: Comparative, Compositional, Descriptive, Relational, and Structural - **Multiple-choice format**: 4 options per question - **Auto-verified**: Cross-modal consistency checks to minimize errors ### Dataset Statistics | Metric | Value | |--------|-------| | Total QA pairs | 20,272 | | Scientific domains | 20 | | Figure types | 12 | | Question types | 5 | ## Usage ```python from datasets import load_dataset dataset = load_dataset("datajuicer/VeriSciQA", split="train") print(dataset[0]) ``` ## Dataset Structure ### Data Fields Each example in the dataset contains: - `image`: (PIL Image) The scientific figure - `question`: (string) The question about the figure - `options`: (list of 4 strings) Multiple-choice options - `answer`: (string) The correct answer choice (A/B/C/D) - `caption`: (string) Original figure caption from the paper - `figure_type`: (string) Type of figure (e.g., "Line Plot", "Bar Chart", "Diagram") - `image_label`: (string) Figure label from the original paper - `section`: (string) Relevant section text from the paper providing context - `domain`: (string) Scientific domain (e.g., "cs", "physics", "math", "cond-mat") - `question_type`: (string) Question type (Comparative, Compositional, Descriptive, Relational, Structural) ### Data Example ```json { "image": "cond-mat0506675_2.jpg", "question": "In the figure, what is the direction of the angular velocity of the flagellar bundle relative to the cell body?", "options": [ "Clockwise around the y-axis", "Counter-clockwise around the y-axis", "Counter-clockwise around the z-axis", "Clockwise around the x-axis" ], "answer": "B", "caption": "Set-up and notations for the mechanical model of {E. coli} swimming near a solid surface.", "figure_type": "Diagram", "image_label": "setup", "section": "We model the bacterium as a single, left-handed rigid helix attached to a spherical body...", "domain": "cond-mat", "question_type": "Relational" } ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @article{verisciqa2025, title={VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering}, author={Li, Yuyi and Chen, Daoyuan and Wang, Zhen and Lu, Yutong and Li, Yaliang}, journal={arXiv preprint arXiv:2511.19899}, year={2025}, url={https://arxiv.org/abs/2511.19899} } ``` ## License This dataset is released under the **CC BY-SA 4.0** license.

提供机构：

datajuicer

5,000+

优质数据集

54 个

任务类型

进入经典数据集