SK-VQA
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Intel/SK-VQA
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SQ-VQA
## Dataset Summary
SK-VQA is a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each paired with context documents that contain the information needed to answer the questions.
The dataset is designed to address the critical need for training and evaluating multimodal LLMs (MLLMs) in context-augmented generation settings, particularly for retrieval-augmented generation (RAG) systems. It enables training MLLMs for contextual reasoning, where models learn to ground answers in provided context documents and images. Models trained on SK-VQA demonstrate superior out-of-domain generalization compared to those trained on existing datasets. It also provides a challenging benchmark for evaluating state-of-the-art models on context-augmented VQA tasks.
## Dataset Details
- **Creators**: Intel Labs
- **Version**: 1.0
- **License**: [Intel OBL Internal R&D Use License Agreement](LICENSE.md)
- **Total Number of Examples**: 2,006,489
- **Number of Training Samples**: 200,000 samples per training subset
- **Number of Test Samples**: 10,744
- **Additional Notes**:
- The dataset includes three versions:
- SK-VQA: Full dataset
- SK-VQAIR: Filters samples where the context explicitly references the image
- SK-VQAIR+CAP: Further filters to retain only samples where the answer is present in the context document
- **Format**: Each example consists of an image, a context paragraph, and multiple question-answer pairs.
-
## Intended Use
- **Primary Uses**: The dataset is primarily intended for benchmarking, testing, and evaluating multimodal large language models (MLLMs) on context-augmented visual question answering (VQA) and retrieval-augmented generation (RAG) tasks. It may also be used for fine-tuning models to improve context reasoning in multimodal settings.
## Data Collection Process
- The dataset was synthetically generated using a fully automated pipeline. Images were sourced from three datasets: [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) (CC-BY 4.0), [Wikipedia/WIT](https://github.com/google-research-datasets/wit) (CC-BY-SA 3.0), and [COCO-Counterfactuals](https://huggingface.co/datasets/Intel/coco-counterfactuals) (CC-BY 4.0). For most examples, GPT-4 was used to generate both a context paragraph and multiple question-answer pairs that require reasoning over both the image and the context. Additionally, a subset of examples uses real context documents directly sourced from the WIT dataset, with GPT-4 generating only the QA pairs.
## Ethical Considerations
<!-- DON'T CHANGE THIS SECTION -->
Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
## Citation
```bibtex
@misc{su2025skvqasyntheticknowledgegeneration,
title={SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs},
author={Xin Su and Man Luo and Kris W Pan and Tien Pei Chou and Vasudev Lal and Phillip Howard},
year={2025},
eprint={2406.19593},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19593},
}
```
## Contact Information
- **Issues**: For any issues or questions regarding the dataset, please contact the maintainers or open an issue in the dataset repository.
# SQ-VQA 数据集卡片
## 数据集概述
SK-VQA 是一个大规模合成多模态数据集,包含超过200万条视觉问答(Visual Question Answering, VQA)样本,每条样本均配有包含问答所需信息的上下文文档。
本数据集旨在解决上下文增强生成场景下,尤其是检索增强生成(Retrieval-Augmented Generation, RAG)系统中,多模态大语言模型(Multimodal Large Language Model, MLLM)的训练与评估刚需。其支持针对上下文推理的多模态大语言模型训练,使模型能够基于提供的上下文文档与图像生成答案。相较于基于现有数据集训练的模型,基于SK-VQA训练的模型展现出更优异的域外泛化能力。同时,本数据集也为在上下文增强视觉问答任务中评估当前顶尖模型提供了极具挑战性的基准测试集。
## 数据集详情
- **创建方**:英特尔实验室(Intel Labs)
- **版本**:1.0
- **授权协议**:[英特尔OBL内部研发使用许可协议](LICENSE.md)
- **总样本数**:2,006,489
- **训练子集样本量**:每个训练子集包含200,000条样本
- **测试样本数**:10,744
- **补充说明**:
- 本数据集包含三个版本:
- SK-VQA:完整数据集
- SK-VQAIR:过滤得到的上下文明确提及图像的样本子集
- SK-VQAIR+CAP:进一步过滤后仅保留答案可从上下文文档中获取的样本子集
- **数据格式**:每条样本由一张图像、一段上下文段落以及多组问答对组成。
## 预期用途
- **核心用途**:本数据集主要用于针对上下文增强视觉问答(VQA)与检索增强生成(RAG)任务的多模态大语言模型(MLLM)的基准测试、检验与评估。此外,也可用于微调模型,以提升其在多模态场景下的上下文推理能力。
## 数据采集流程
本数据集通过全自动化流水线合成生成。图像来源包括三个数据集:[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/)(CC-BY 4.0)、[维基百科/WIT](https://github.com/google-research-datasets/wit)(CC-BY-SA 3.0)以及[COCO-Counterfactuals](https://huggingface.co/datasets/Intel/coco-counterfactuals)(CC-BY 4.0)。对于绝大多数样本,均使用GPT-4生成上下文段落与多组问答对,这些问答对需要同时基于图像与上下文进行推理。此外,部分子集样本直接采用从WIT数据集获取的真实上下文文档,仅由GPT-4生成问答对。
## 伦理考量
英特尔致力于尊重人权,避免造成或加剧对人权的负面影响。详见[《英特尔全球人权原则》](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf)。英特尔的产品与软件仅可用于不会造成或加剧对人权负面影响的应用场景。
## 引用格式
bibtex
@misc{su2025skvqasyntheticknowledgegeneration,
title={SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs},
author={Xin Su and Man Luo and Kris W Pan and Tien Pei Chou and Vasudev Lal and Phillip Howard},
year={2025},
eprint={2406.19593},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19593},
}
## 联系方式
- **问题反馈**:若对本数据集存在任何问题或疑问,请联系维护人员或在数据集仓库中提交issue。
提供机构:
maas
创建时间:
2025-08-01



