VisualSimpleQA

Name: VisualSimpleQA
Creator: maas
Published: 2025-11-18 16:26:45
License: 暂无描述

魔搭社区2025-11-18 更新2025-03-22 收录

下载链接：

https://modelscope.cn/datasets/AI-Bench/VisualSimpleQA

下载链接

链接失效反馈

官方服务：

资源简介：

# VisualSimpleQA ## Introduction VisualSimpleQA is a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation based on this benchmark across different models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset reviewer above illustrates 129 samples from VisualSimpleQA-hard. `arXiv:` [https://arxiv.org/pdf/2503.06492](https://arxiv.org/pdf/2503.06492) **Data Example:** ``` {'id': 369, 'multimodal_question': 'Which institution did the creator of this cartoon duck donate her natural science-related paintings to?', 'answer': 'The Armitt Museum, Gallery, Library', 'rationale': 'Jemima Puddle-Duck', 'text_only_question': 'Which institution did the creator of Jemima Puddle-Duck donate her natural science-related paintings to?', 'image_source': 'https://www.gutenberg.org/files/14814/14814-h/images/15-tb.jpg', 'evidence': 'https://www.armitt.com/beatrix-potter-exhibition/\nhttps://en.wikipedia.org/wiki/Beatrix_Potter', 'resolution': '400x360', 'proportion_of_roi': '0.2232', 'category': 'research and education', 'text_in_image': 'absence', 'rationale_granularity': 'fine-grained', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=400x360 at 0x7FE82C270D70>, 'cropped_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=164x196 at 0x7FE82C329550>} ``` ## File Structure `VisualSimpleQA/` This directory contains all 500 samples of VisualSimpleQA stored in parquet files. `VisualSimpleQA_hard/` This directory contains 129 VisualSimpleQA-hard samples stored in a parquet file. These samples are selected based on well-defined criteria to ensure they represent more challenging cases from VisualSimpleQA. ## Disclaimer This dataset contains images collected from various sources. The authors do NOT claim ownership or copyright over the images. The images may be subject to third-party rights, and users are solely responsible for verifying the legal status of any content before use. - Intended Use: The images are provided for non-commercial research purposes only. - Redistribution Prohibition: You may NOT redistribute or modify the images without permission from original rights holders. - Reporting Violations: If you encounter any sample potentially breaching copyright or licensing rules, contact us at yanlingwang777@gmail.com. Verified violations will be removed promptly. The authors disclaim all liability for copyright infringement or misuse arising from the use of this dataset. Users assume full legal responsibility for their actions. ## License - Text Data: Licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - Images: Subject to custom terms (see Disclaimer above). ## Citation **BibTeX:** ```bibtex @article{wang2025visualsimpleqa, title={VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering}, author={Yanling Wang and Yihan Zhao and Xiaodong Chen and Shasha Guo and Lixin Liu and Haoyang Li and Yong Xiao and Jing Zhang and Qi Li and Ke Xu}, journal={arXiv preprint arXiv: 2503.06492}, year={2025} } ```

# VisualSimpleQA ## 简介 VisualSimpleQA是一款多模态事实求证基准测试集，具备两大核心特性。其一，它支持对大视觉语言模型（Large Vision-Language Model, LVLM）在视觉与语言模态下的轻量化解耦评估；其二，该数据集纳入了明确的难度判定标准，用于指导人工标注，并支持从中提取高难度子集VisualSimpleQA-hard。针对15款大视觉语言模型的实验结果表明，即便如GPT-4o这类当前最优模型，在VisualSimpleQA上的多模态事实求证问答任务准确率仅能达到60%以上，在VisualSimpleQA-hard子集上的准确率更是不足30%。此外，基于该基准集开展的跨模型解耦评估研究显示，当前大视觉语言模型的视觉与语言模块均存在较大的优化空间。上述数据集预览示例展示了VisualSimpleQA-hard子集的129个样本。 `arXiv:` [https://arxiv.org/pdf/2503.06492](https://arxiv.org/pdf/2503.06492) **数据示例：** {'样本ID': 369, '多模态问题': '这只卡通鸭子的创作者将其与自然科学相关的画作捐赠给了哪家机构？', '答案': '阿米特博物馆、美术馆与图书馆', '推理依据': '杰米玛·帕德尔鸭', '纯文本问题': '杰米玛·帕德尔鸭的创作者将其与自然科学相关的画作捐赠给了哪家机构？', '图片来源': 'https://www.gutenberg.org/files/14814/14814-h/images/15-tb.jpg', '佐证资料': 'https://www.armitt.com/beatrix-potter-exhibition https://en.wikipedia.org/wiki/Beatrix_Potter', '分辨率': '400x360', '感兴趣区域占比': '0.2232', '类别': '研究与教育', '图片内文本': '无', '推理粒度': '细粒度', '图片': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=400x360 at 0x7FE82C270D70>, '裁剪图片': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=164x196 at 0x7FE82C329550>} ## 文件结构 `VisualSimpleQA/` 该目录包含全部500个VisualSimpleQA数据集样本，均以Parquet格式存储。 `VisualSimpleQA_hard/` 该目录包含129个VisualSimpleQA-hard子集样本，同样以Parquet格式存储。这些样本通过明确的筛选标准选出，以确保其能代表VisualSimpleQA中难度较高的案例。 ## 免责声明本数据集包含从各类渠道收集的图片。数据集作者并未声明对这些图片拥有所有权或版权。图片可能受第三方权利约束，使用者需自行验证所使用内容的法律合规性。 - 使用用途：本数据集图片仅可用于非商业性研究目的。 - 禁止重新分发：未经原版权持有者许可，不得重新分发或修改本数据集的图片。 - 违规举报：若发现存在疑似违反版权或许可协议的样本，请发送邮件至yanlingwang777@gmail.com与我们联系。经核实的违规内容将被立即移除。数据集作者不对因使用本数据集而引发的版权侵权或不当使用承担任何责任。使用者需为自身行为承担全部法律责任。 ## 许可协议 - 文本数据：采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)许可协议授权。 - 图片：受自定义条款约束（详见上述免责声明）。 ## 引用 **BibTeX:** bibtex @article{wang2025visualsimpleqa, title={VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering}, author={Yanling Wang and Yihan Zhao and Xiaodong Chen and Shasha Guo and Lixin Liu and Haoyang Li and Yong Xiao and Jing Zhang and Qi Li and Ke Xu}, journal={arXiv preprint arXiv: 2503.06492}, year={2025} }

提供机构：

maas

创建时间：

2025-03-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集