sqrti/SPA-VL

Name: sqrti/SPA-VL
Creator: sqrti
Published: 2024-07-03 18:37:11
License: 暂无描述

Hugging Face2024-07-03 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/sqrti/SPA-VL

下载链接

链接失效反馈

官方服务：

资源简介：

SPA-VL是一个大规模、高质量且多样化的对齐数据集，旨在提高视觉语言模型（VLMs）的安全性对齐。数据集涵盖了6个有害领域、13个类别和53个子类别，包含100,788个样本，每个样本包括问题、图像、选择的回答和拒绝的回答。数据集的目标是在不损害VLMs核心能力的情况下，增强其无害性和帮助性。数据集的结构包括训练、验证和测试部分，每个部分都有详细的字段描述。数据集的创建过程包括从LAION-5B数据集中收集图像，使用Gemini 1.0 Pro Vision生成问题，并从12个不同模型中生成回答，最后通过GPT-4V进行偏好标注。数据集的使用包括直接使用和训练模型，但不应用于生成有害或恶意内容。

SPA-VL is a large-scale, high-quality, and diverse alignment dataset designed to improve the safety alignment of Vision Language Models (VLMs). It covers 6 harmfulness domains, 13 categories, and 53 subcategories, containing 100,788 samples of the quadruple (question, image, chosen response, rejected response). The dataset aims to enhance the harmlessness and helpfulness of VLMs without compromising their core capabilities. It includes images, questions related to the images, chosen responses, and rejected responses, structured into training, validation, and test splits. The dataset is designed with detailed descriptions of the data instances, fields, and splits, as well as information on the dataset creation process, source data, annotations, and considerations for using the data. The dataset is curated by the University of Science and Technology of China, Fudan NLP, and Shanghai Artificial Intelligence Laboratory, and is licensed under CC BY 4.0.

提供机构：

sqrti

原始信息汇总

数据集许可证信息

许可证类型: CC-BY-4.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集