Vi0509/kaeva-deepfake-datasets

Name: Vi0509/kaeva-deepfake-datasets
Creator: Vi0509
Published: 2026-02-25 04:28:16
License: 暂无描述

Hugging Face2026-02-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Vi0509/kaeva-deepfake-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - image-classification tags: - deepfake-detection - ai-generated - synthetic-media - kaeva pretty_name: Kaeva Deepfake Detection Training Datasets size_categories: - 1M<n<10M --- # Kaeva Deepfake Detection — Training Datasets (V1–V9) This repository documents **all training datasets** used across Kaeva deepfake detection model versions V1 through V9. No raw data is hosted here — this serves as a comprehensive reference card. Training code: [`Viraj-FG/kaeva-verify/training/`](https://github.com/Viraj-FG/kaeva-verify/tree/main/training) --- ## Dataset Inventory ### Established Benchmarks | Dataset | Type | Source | License | |---|---|---|---| | **CIFAKE** | Real + AI-generated (CIFAR-10 scale) | [HF: Bird/CIFAKE](https://huggingface.co/datasets/Bird/CIFAKE) | CC BY-SA 4.0 | | **ArtiFact** | Multi-generator forensics benchmark | [GitHub: awsaf49/artifact](https://github.com/awsaf49/artifact) | Research | | **OpenFake** | Open-source deepfake benchmark | [GitHub](https://github.com/OpenFake) | Research | | **DeepFakeFace** | Face-swap deepfakes | [Kaggle](https://www.kaggle.com/datasets) | Research | | **GenImage** | Multi-generator image detection | [GitHub: GenImage-Dataset](https://github.com/GenImage-Dataset/GenImage) | Research | | **Kaggle DFD** | Deepfake Detection Challenge | [Kaggle DFD](https://www.kaggle.com/c/deepfake-detection-challenge) | Competition | ### Face Datasets (Real Baselines) | Dataset | Description | Source | License | |---|---|---|---| | **CelebA-HQ** | 30k high-quality celebrity faces | [GitHub: tkarras/progressive_growing_of_gans](https://github.com/tkarras/progressive_growing_of_gans) | Non-commercial research | | **FFHQ** | 70k Flickr-sourced high-quality faces | [GitHub: NVlabs/ffhq-dataset](https://github.com/NVlabs/ffhq-dataset) | CC BY-NC-SA 4.0 | ### Large-Scale Image Datasets | Dataset | Description | Source | License | |---|---|---|---| | **ImageNet-1k** | 1.28M images, 1000 classes | [image-net.org](https://www.image-net.org/) | Research (non-commercial) | | **ai-artbench** | AI-generated art benchmark | [HF: ramonpzg/ai-artbench](https://huggingface.co/datasets/ramonpzg/ai-artbench) | MIT | | **dima806/ai_vs_real** | AI vs real photo classification | [HF: dima806/ai_vs_real](https://huggingface.co/datasets/dima806/ai_vs_real) | CC BY 4.0 | ### Web-Scraped Sources | Source | Type | Usage | |---|---|---| | **thispersondoesnotexist.com** | GAN-generated faces (StyleGAN) | Fake samples | | **picsum.photos** | Random real photographs | Real baseline samples | | **StyleGAN3** | NVIDIA StyleGAN3 generated faces | Fake samples (GAN family) | --- ## V9 Generator Coverage V9 expanded coverage to **10 modern generators** to ensure broad generalization: | Generator | Family | Notes | |---|---|---| | `sdxl_turbo` | Stable Diffusion XL Turbo | Distilled, few-step | | `playground_v2.5` | Playground AI | Aesthetic-optimized | | `pixart_sigma` | PixArt-Σ | DiT-based | | `kandinsky3` | Kandinsky 3 | Sber AI | | `sd35_medium` | Stable Diffusion 3.5 Medium | MMDiT | | `kolors` | Kolors (Kwai) | Chinese text-to-image | | `sd35_large` | Stable Diffusion 3.5 Large | MMDiT (large) | | `flux_schnell` | FLUX.1 [schnell] | Black Forest Labs, distilled | | `flux_dev` | FLUX.1 [dev] | Black Forest Labs, guidance-distilled | | `wan2.1` | Wan 2.1 | Video/image generation | --- ## Data Principles ### 1. Real Baseline — Pristine All real images are sourced at highest available quality with **no re-compression**. This ensures the model learns authentic camera/sensor characteristics rather than compression artifacts. ### 2. Compression Washing for Fakes Fake images undergo **compression washing** (JPEG re-save at varying quality levels, WebP conversion, etc.) to strip superficial generation artifacts. This forces the model to detect deeper structural signals rather than relying on compression-level shortcuts. ### 3. GER Buffer — Hard Negatives A **Generator-Error-Rate (GER) buffer** of hard negative samples is maintained. These are AI-generated images that closely mimic real image statistics and are difficult to classify. Including them during training improves calibration and pushes the decision boundary into the ambiguous region where it matters most. --- ## Training Scripts All training code is maintained in the private repository: ``` Viraj-FG/kaeva-verify/training/ ├── train_lnclip.py # LNCLIP LayerNorm probe training ├── train_audio.py # Audio deepfake detector training ├── data_pipeline.py # Dataset loading & augmentation ├── compression_wash.py # Compression washing transforms └── ger_buffer.py # GER hard negative mining ``` --- ## Citation If you use this dataset documentation or the Kaeva models, please reference: ``` @misc{kaeva2026, title={Kaeva: Multi-Modal Deepfake Detection}, author={Viraj}, year={2026}, url={https://github.com/Viraj-FG/kaeva-verify} } ```

许可证：MIT task_categories: - 图像分类（image-classification） tags: - 深度伪造检测（deepfake-detection） - AI生成（ai-generated） - 合成媒体（synthetic-media） - Kaeva pretty_name: Kaeva深度伪造检测训练数据集 size_categories: - 100万 < n < 1000万 # Kaeva深度伪造检测 — 训练数据集（V1–V9）本仓库用于记录Kaeva深度伪造检测模型V1至V9版本所使用的全部训练数据集。本仓库未托管任何原始数据，仅作为完整的参考说明文档。训练代码：[`Viraj-FG/kaeva-verify/training/`](https://github.com/Viraj-FG/kaeva-verify/tree/main/training) --- ## 数据集清单 ### 已确立的基准数据集 | 数据集名称 | 数据集类型 | 来源 | 许可证 | |---|---|---|---| | **CIFAKE** | 真实图像+AI生成图像（规模与CIFAR-10相当） | [Hugging Face: Bird/CIFAKE](https://huggingface.co/datasets/Bird/CIFAKE) | CC BY-SA 4.0 | | **ArtiFact** | 多生成器取证基准数据集 | [GitHub: awsaf49/artifact](https://github.com/awsaf49/artifact) | 研究专用许可 | | **OpenFake** | 开源深度伪造基准数据集 | [GitHub](https://github.com/OpenFake) | 研究专用许可 | | **DeepFakeFace** | 人脸替换型深度伪造数据集 | [Kaggle](https://www.kaggle.com/datasets) | 研究专用许可 | | **GenImage** | 多生成器图像检测基准数据集 | [GitHub: GenImage-Dataset/GenImage](https://github.com/GenImage-Dataset/GenImage) | 研究专用许可 | | **Kaggle DFD** | 深度伪造检测挑战赛数据集 | [Kaggle DFD竞赛页面](https://www.kaggle.com/c/deepfake-detection-challenge) | 竞赛专用许可 | ### 人脸数据集（真实基准集） | 数据集名称 | 数据集描述 | 来源 | 许可证 | |---|---|---|---| | **CelebA-HQ** | 3万张高质量名人人脸图像 | [GitHub: tkarras/progressive_growing_of_gans](https://github.com/tkarras/progressive_growing_of_gans) | 非商业研究专用许可 | | **FFHQ** | 7万张源自Flickr的高质量人脸图像 | [GitHub: NVlabs/ffhq-dataset](https://github.com/NVlabs/ffhq-dataset) | CC BY-NC-SA 4.0 | ### 大规模图像数据集 | 数据集名称 | 数据集描述 | 来源 | 许可证 | |---|---|---|---| | **ImageNet-1k** | 128万张图像，涵盖1000个类别 | [image-net.org](https://www.image-net.org/) | 非商业研究专用许可 | | **ai-artbench** | AI生成艺术作品基准数据集 | [Hugging Face: ramonpzg/ai-artbench](https://huggingface.co/datasets/ramonpzg/ai-artbench) | MIT协议 | | **dima806/ai_vs_real** | AI生成图像与真实图像分类数据集 | [Hugging Face: dima806/ai_vs_real](https://huggingface.co/datasets/dima806/ai_vs_real) | CC BY 4.0 | ### 网络爬取数据源 | 数据源 | 数据类型 | 使用场景 | |---|---|---| | **thispersondoesnotexist.com** | GAN生成的人脸图像（基于StyleGAN） | 伪造样本 | | **picsum.photos** | 随机真实照片 | 真实基准样本 | | **StyleGAN3** | NVIDIA StyleGAN3生成的人脸图像 | GAN家族伪造样本 | --- ## V9版本生成器覆盖范围 V9版本将覆盖范围扩展至**10款现代图像生成器**，以确保模型具备广泛的泛化能力： | 生成器名称 | 生成器家族 | 备注 | |---|---|---| | `sdxl_turbo` | Stable Diffusion XL Turbo | 蒸馏型少步生成模型 | | `playground_v2.5` | Playground AI | 美学优化型生成器 | | `pixart_sigma` | PixArt-Σ | 基于DiT的生成器 | | `kandinsky3` | Kandinsky 3 | Sber AI开发 | | `sd35_medium` | Stable Diffusion 3.5 Medium | 基于MMDiT架构 | | `kolors` | Kolors（Kwai） | 中文文本到图像生成模型 | | `sd35_large` | Stable Diffusion 3.5 Large | 大型MMDiT架构生成器 | | `flux_schnell` | FLUX.1 [schnell] | Black Forest Labs开发的蒸馏型生成器 | | `flux_dev` | FLUX.1 [dev] | Black Forest Labs开发的带引导蒸馏的生成器 | | `wan2.1` | Wan 2.1 | 视频/图像生成模型 | --- ## 数据处理原则 ### 1. 真实基准集：无预处理失真所有真实图像均以最高可用质量获取，**未经过任何重新压缩**。这确保模型能够学习真实相机/传感器的成像特征，而非压缩伪影。 ### 2. 伪造样本的压缩清洗处理所有伪造图像均经过**压缩清洗**处理（以不同质量等级重新保存为JPEG格式、转换为WebP格式等），以去除表层的生成伪影。这迫使模型学习更深层的结构特征，而非依赖压缩层级的捷径线索。 ### 3. GER缓冲区：难例负样本我们维护了一个包含难例负样本的**生成器错误率（Generator-Error-Rate, GER）缓冲区**。这些样本为AI生成的图像，其统计特征与真实图像高度相似，难以被分类。在训练过程中加入此类样本能够提升模型的校准能力，并将决策边界推进至最关键的模糊区域。 --- ## 训练脚本所有训练代码托管于私有仓库： Viraj-FG/kaeva-verify/training/ ├── train_lnclip.py # LNCLIP LayerNorm 探针训练脚本 ├── train_audio.py # 音频深度伪造检测器训练脚本 ├── data_pipeline.py # 数据集加载与数据增强脚本 ├── compression_wash.py # 压缩清洗数据变换脚本 └── ger_buffer.py # GER难例负样本挖掘脚本 --- ## 引用说明若您使用本数据集文档或Kaeva系列模型，请引用以下文献： bibtex @misc{kaeva2026, title={Kaeva: Multi-Modal Deepfake Detection}, author={Viraj}, year={2026}, url={https://github.com/Viraj-FG/kaeva-verify} }

提供机构：

Vi0509

5,000+

优质数据集

54 个

任务类型

进入经典数据集