five

kyh9191/Safe-LLaVA

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kyh9191/Safe-LLaVA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: bigscience-openrail-m task_categories: - question-answering language: - en tags: - privacy - vision-language - instruction-tuning - multimodal size_categories: - 100B<n<1T configs: - config_name: PRISM_test data_files: - split: test path: PRISM_test/test-* dataset_info: config_name: PRISM_test features: - name: question_id dtype: string - name: image dtype: string - name: text dtype: string - name: category dtype: string splits: - name: test num_bytes: num_examples: download_size: dataset_size: --- # 🌟 Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset **Safe-LLaVA** is a privacy-enhanced version of the original LLaVA dataset, developed to systematically remove sensitive biometric attributes such as **gender**, **race**, **age**, **eye color**, and **body weight**. This dataset is designed for **privacy-safe pretraining**, **instruction tuning**, and **benchmarking Vision-Language Models (VLMs)** under biometric privacy constraints. --- ## 📑 Dataset Summary - **Name**: Safe-LLaVA - **Source**: Derived from LLaVA v1.5 (LAION, COCO, GQA, OCR_VQA, VG, etc.) - **Size**: - 558K (pretraining) - 665K (instruction tuning) - **Privacy Strategy**: GPT-4o–based rewriting and filtering to remove biometric leakage --- ## 📁 File Descriptions The repository contains nine key files: | File | Purpose | |------------------------------|-------------------------------------------| | `Safe_blip_laion_cc_sbu_558k.json` | Pretraining dataset (558K samples) | | `Safe_llava_v1_5_mix665k.json` | Instruction tuning dataset (665K samples) | | `small_PRISM_refusal_soft.jsonl` | Soft prompt refusal benchmark (Part 1 / 2) | | `large_PRISM_refusal_soft.jsonl` | Soft prompt refusal benchmark (Part 2 / 2) | | `small_PRISM_refusal_hard.jsonl` | Hard prompt refusal benchmark (Part 1 / 2) | | `large_PRISM_refusal_hard.jsonl` | Hard prompt refusal benchmark (Part 2 / 2) | | `small_PRISM_implicit_leakage.jsonl` | Implicit leakage benchmark (Part 1 / 2) | | `large_PRISM_implicit_leakage.jsonl` | Implicit leakage benchmark (Part 2 / 2) | | `images.zip` | Image files used in PRISM evaluation | --- ## 🧪 Benchmarking: PRISM The `*_PRISM_*.jsonl` and `images.zip` files are used for **PRISM**, a benchmark designed to evaluate: 1. **Refusal Accuracy**: How well a model refuses to answer biometric-related prompts 2. **Implicit Leakage**: How much sensitive information is leaked in open-ended generation --- ## 🔗 Companion Repository To set up dataset structure for training and evaluating, visit our GitHub: 👉 [https://github.com/Kimyounggun99/Safe-LLaVA](https://github.com/Kimyounggun99/Safe-LLaVA) Our GitHub also provides code for training and testing.
提供机构:
kyh9191
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作