SVIT

Name: SVIT
Creator: maas
Published: 2026-01-07 19:22:23
License: 暂无描述

魔搭社区2026-01-07 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/BAAI/SVIT

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for SVIT Scale up visual instruction tuning to millions by GPT-4. ## Dataset Description - **Repository:** https://github.com/BAAI-DCAI/Visual-Instruction-Tuning - **Paper:** https://arxiv.org/pdf/2307.04087.pdf ## Introduction We Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image description, by prompting GPT-4 with the abundant manual annotations of image. The structure of the repository: - **raw**: The folder contains the original images and annotations from Visual Genome and MS-COCO. - **data**: The folder contains the dataset in SVIT's original format. - **format/llava-v1.5**: We also provide the dataset in LLaVA-v1.5's format to better align with the community. The image paths are compatible with the ones in [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#visual-instruction-tuning). The differences of QA pairs in this folder and the ones in "data" folder are: (1) For referring QAs, we randomly sample a response formatting instruction ("Provide the bounding boxes of the mentioned objects.", "Include the coordinates for each mentioned object.", "Locate the objects with their coordinates.") and append it after each question. The "\<st\>" prefix and "\<ed\>" suffix are removed. As discussed [here](https://github.com/haotian-liu/LLaVA/issues/606), the bounding boxes are padded to square as per LLaVA-v1.5's settings. (2) "\<image\>" token is added in the first question of each conversation. The detailed data recipes of SVIT_core_150K and SVIT_mix_665K could be found in the paper. - GitHub: https://github.com/BAAI-DCAI/Visual-Instruction-Tuning - Paper: https://arxiv.org/pdf/2307.04087.pdf ## License The dataset is licensed under a Creative Commons Attribution 4.0 License. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use. The use of original images and annotations from Visual Genome and MS-COCO should comply with the original licenses. ## Contact us If you have any comments or questions about the dataset, feel free to create an issue in GitHub: https://github.com/BAAI-DCAI/Visual-Instruction-Tuning/issues.

# SVIT数据集卡片通过GPT-4将视觉指令微调扩展至百万级规模。 ## 数据集描述 - **代码仓库**：https://github.com/BAAI-DCAI/Visual-Instruction-Tuning - **论文**：https://arxiv.org/pdf/2307.04087.pdf ## 简介我们通过GPT-4结合丰富的图像人工标注，构建了包含420万条视觉指令微调数据的数据集，以此实现视觉指令微调（Visual Instruction Tuning）的规模化扩展（该数据集即命名为SVIT）。其中包含160万轮对话问答（QA）对、160万条复杂推理问答对、100万指代式问答对以及10.6万条精细化图像描述。本仓库的目录结构如下： - **raw**：该文件夹存储源自Visual Genome与MS-COCO的原始图像及标注信息。 - **data**：该文件夹存储SVIT原生格式的数据集。 - **format/llava-v1.5**：为更好适配社区生态，我们同时提供LLaVA-v1.5格式的数据集。其图像路径与[LLaVA-v1.5](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#visual-instruction-tuning)中的路径兼容。该文件夹下的问答对与`data`文件夹中的问答对存在两处差异：(1) 对于指代式问答，我们会随机采样一条响应格式指令（如"提供提及对象的边界框""为每个提及对象添加坐标""定位目标对象并给出其坐标"），并将其追加至每个问题之后；同时移除了前缀`<st>`与后缀`<ed>`。正如[此处讨论](https://github.com/haotian-liu/LLaVA/issues/606)所述，按照LLaVA-v1.5的设置，边界框会被填充为正方形。(2) 在每轮对话的首个问题中添加`<image>` Token。SVIT_core_150K与SVIT_mix_665K的详细数据构建流程可参阅论文。 - GitHub：https://github.com/BAAI-DCAI/Visual-Instruction-Tuning - 论文：https://arxiv.org/pdf/2307.04087.pdf ## 许可协议本数据集采用知识共享署名4.0许可协议（Creative Commons Attribution 4.0 License）进行授权。使用时需遵守OpenAI相关政策：https://openai.com/policies/terms-of-use。对于源自Visual Genome与MS-COCO的原始图像及标注，其使用需符合原许可协议的要求。 ## 联系我们若您对本数据集有任何意见或疑问，欢迎在GitHub仓库提交Issue：https://github.com/BAAI-DCAI/Visual-Instruction-Tuning/issues。

提供机构：

maas

创建时间：

2024-04-10

搜集汇总

数据集介绍

背景与挑战

背景概述

SVIT是一个大规模视觉指令调优数据集，包含420万条数据，旨在通过GPT-4扩展多模态学习能力。数据集涵盖对话问答、复杂推理问答、指代问答和图像描述等多种任务类型，基于Visual Genome和MS-COCO的标注生成。它提供原始格式和与LLaVA-v1.5兼容的格式，便于社区集成和使用，适用于训练和评估视觉语言模型。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集