kyh9191/Safe-LLaVA
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kyh9191/Safe-LLaVA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: bigscience-openrail-m
task_categories:
- question-answering
language:
- en
tags:
- privacy
- vision-language
- instruction-tuning
- multimodal
size_categories:
- 100B<n<1T
configs:
- config_name: PRISM_test
data_files:
- split: test
path: PRISM_test/test-*
dataset_info:
config_name: PRISM_test
features:
- name: question_id
dtype: string
- name: image
dtype: string
- name: text
dtype: string
- name: category
dtype: string
splits:
- name: test
num_bytes:
num_examples:
download_size:
dataset_size:
---
# 🌟 Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset
**Safe-LLaVA** is a privacy-enhanced version of the original LLaVA dataset, developed to systematically remove sensitive biometric attributes such as **gender**, **race**, **age**, **eye color**, and **body weight**.
This dataset is designed for **privacy-safe pretraining**, **instruction tuning**, and **benchmarking Vision-Language Models (VLMs)** under biometric privacy constraints.
---
## 📑 Dataset Summary
- **Name**: Safe-LLaVA
- **Source**: Derived from LLaVA v1.5 (LAION, COCO, GQA, OCR_VQA, VG, etc.)
- **Size**:
- 558K (pretraining)
- 665K (instruction tuning)
- **Privacy Strategy**: GPT-4o–based rewriting and filtering to remove biometric leakage
---
## 📁 File Descriptions
The repository contains nine key files:
| File | Purpose |
|------------------------------|-------------------------------------------|
| `Safe_blip_laion_cc_sbu_558k.json` | Pretraining dataset (558K samples) |
| `Safe_llava_v1_5_mix665k.json` | Instruction tuning dataset (665K samples) |
| `small_PRISM_refusal_soft.jsonl` | Soft prompt refusal benchmark (Part 1 / 2) |
| `large_PRISM_refusal_soft.jsonl` | Soft prompt refusal benchmark (Part 2 / 2) |
| `small_PRISM_refusal_hard.jsonl` | Hard prompt refusal benchmark (Part 1 / 2) |
| `large_PRISM_refusal_hard.jsonl` | Hard prompt refusal benchmark (Part 2 / 2) |
| `small_PRISM_implicit_leakage.jsonl` | Implicit leakage benchmark (Part 1 / 2) |
| `large_PRISM_implicit_leakage.jsonl` | Implicit leakage benchmark (Part 2 / 2) |
| `images.zip` | Image files used in PRISM evaluation |
---
## 🧪 Benchmarking: PRISM
The `*_PRISM_*.jsonl` and `images.zip` files are used for **PRISM**, a benchmark designed to evaluate:
1. **Refusal Accuracy**: How well a model refuses to answer biometric-related prompts
2. **Implicit Leakage**: How much sensitive information is leaked in open-ended generation
---
## 🔗 Companion Repository
To set up dataset structure for training and evaluating, visit our GitHub:
👉 [https://github.com/Kimyounggun99/Safe-LLaVA](https://github.com/Kimyounggun99/Safe-LLaVA)
Our GitHub also provides code for training and testing.
提供机构:
kyh9191



