Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection

Name: Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection
Creator: Shanmuk4622
Published: 2025-12-11 03:25:06
License: 暂无描述

Hugging Face2025-12-11 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 --- HybridForensicsNet: Standardized 512px Image Forensics Dataset ============================================================= 📌 Executive Summary -------------------- HybridForensicsNet is a curated, balanced dataset for training and benchmarking digital image forensic algorithms. It targets a **hybrid threat model** covering both Generative Adversarial Networks (GANs) and Latent Diffusion Models (LDMs). All 10,000 images are standardized to **512×512** using high-quality Lanczos resampling, keeping high-frequency artifacts intact for CNNs and ViTs. 📊 Dataset Specifications ------------------------ | Metric | Value | Description | | ------------- | ------------ | ------------------------------------------------ | | Total Images | 10,000 | Balanced across Real and Fake classes | | Total Size | ~780 MB | High-quality JPEG compression | | Format | JPEG (Q=95) | Quality factor 95 | | Resolution | 512 × 512 | Fixed dimensions (Lanczos resampled) | | Channels | RGB | 3 channels (standardized) | | Balance | 50% Real/Fake| Strictly balanced to prevent prior bias | 📂 Directory Structure --------------------- The dataset follows the standard `ImageFolder` layout for PyTorch/TensorFlow loaders. ``` HybridForensics_Dataset_512/ ├── Real/ (5,000 images) │ ├── FFHQ/ (2,500) - High-quality human faces │ └── MS_COCO/ (2,500) - General objects & scenes │ ├── Fake_GAN/ (2,500 images) │ ├── ProGAN/ (1,250) - GAN-generated anime/faces │ └── StyleGAN3/ (1,250) - Structural/texture proxy* │ └── Fake_Diffusion/ (2,500 images) ├── SDXL/ (1,250) - Structural/texture proxy* └── Midjourney/ (1,250) - Structural/texture proxy* ``` 🧠 Data Composition & Provenance ------------------------------- **Class 0: REAL (Authentic Imagery)** - **FFHQ (Flickr-Faces-HQ):** Official NVIDIA mirror; diverse, high-quality portraits. - **MS COCO:** Real-world scenes, organic textures, man-made objects; mitigates overfitting to faces. **Class 1: FAKE (Synthetic Imagery)** - **ProGAN:** AnimeFace-derived; exhibits early GAN artifacts (checkerboarding, asymmetry). - **StyleGAN3 / SDXL / Midjourney (Texture Proxies):** - *Curation:* High-frequency texture samples (Food101) to guarantee stable, high-res 512px imagery. - *Utility:* Structural proxies for training on texture/color anomalies common to high-res synthesis. 🛠️ Preprocessing Pipeline ------------------------ 1) Format validation: header integrity checked; corrupted files removed. 2) Channel standardization: grayscale/CMYK → RGB. 3) Lanczos resampling to 512×512 (PIL.Image.LANCZOS) to preserve spectral detail. 4) JPEG save at Q=95 to balance fidelity and size. 💻 Usage Guide (PyTorch) ----------------------- ```python import torchvision.datasets as datasets import torchvision.transforms as transforms from torch.utils.data import DataLoader # 1. Define transformations (size already 512x512) transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # 2. Load dataset dataset_path = "./HybridForensics_Dataset_512" dataset = datasets.ImageFolder(root=dataset_path, transform=transform) # 3. Create loader dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # 4. Verify classes print(dataset.classes) # ['Fake_Diffusion', 'Fake_GAN', 'Real'] ``` ⚠️ Limitations & Ethical Notes ------------------------------ - **Proxy data:** SDXL, Midjourney, and StyleGAN3 folders use texture-rich proxies, not direct generations. Best suited for artifact detection, not semantic fidelity studies. - **Bias:** Source datasets (FFHQ, COCO) may contain demographic and content biases. 📄 Citation ----------- If you use this dataset structure in your research, please cite: ``` @dataset{hybridforensics2025, author = {SHANMUKESH BONALA}, title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset}, year = {2025}, publisher = {Hugging Face}, version = {1.0.0} } ```

---许可协议：知识共享署名-非商业性使用4.0许可（CC BY-NC 4.0）--- # HybridForensicsNet：标准化512像素图像取证数据集 ============================================================= 📌 核心概要 -------------------- HybridForensicsNet是一个经过精心筛选、类别均衡的数据集，用于训练与评测数字图像取证算法。该数据集针对**混合威胁模型（hybrid threat model）**设计，同时覆盖生成对抗网络（Generative Adversarial Networks，GANs）与潜在扩散模型（Latent Diffusion Models，LDMs）两类生成式伪造源。全部10000张图像均通过高质量兰索斯重采样（Lanczos resampling）统一调整为**512×512像素**，完整保留高频伪影特征，适配卷积神经网络（CNNs）与视觉Transformer（ViTs）模型的特征提取需求。 📊 数据集规格参数 ------------------------ | 指标 | 数值 | 说明 | | ------------- | ------------ | ------------------------------------------------ | | 总图像数 | 10,000 | 真实、伪造两类样本严格均衡分布 | | 总文件大小 | 约780 MB | 采用高质量JPEG压缩格式 | | 图像格式 | JPEG（Q=95） | 质量因子设置为95 | | 图像分辨率 | 512 × 512 | 固定尺寸（经兰索斯重采样处理） | | 通道数 | RGB | 标准化为3通道RGB格式 | | 类别均衡度 | 50% 真实/50% 伪造 | 严格平衡类别占比，避免模型产生先验偏差 | 📂 目录结构 -------------------- 该数据集采用PyTorch与TensorFlow通用的标准`ImageFolder`目录布局，便于快速加载。 HybridForensics_Dataset_512/ ├── Real/ （共5000张图像） │ ├── FFHQ/ （2500张）——Flickr人脸高清数据集，高质量人像样本 │ └── MS_COCO/ （2500张）——微软通用图像标注数据集，涵盖通用物体与场景 │ ├── Fake_GAN/ （共2500张图像） │ ├── ProGAN/ （1250张）——渐进式生成对抗网络生成的动漫/人脸图像 │ └── StyleGAN3/ （1250张）——结构/纹理代理样本* │ └── Fake_Diffusion/ （共2500张图像） ├── SDXL/ （1250张）——Stable Diffusion XL（SDXL）生成的结构/纹理代理样本* └── Midjourney/ （1250张）——Midjourney生成的结构/纹理代理样本* 🧠 数据组成与来源 ------------------------------- ### 类别0：真实图像（Authentic Imagery） - **Flickr人脸高清数据集（FFHQ）**：官方NVIDIA镜像资源，涵盖多样化高质量人像肖像。 - **微软通用图像标注数据集（MS COCO）**：包含真实世界场景、自然纹理与人工造物样本，可缓解模型对人脸类别的过拟合风险。 ### 类别1：伪造图像（Synthetic Imagery） - **ProGAN**：基于AnimeFace衍生的生成模型，会呈现早期GAN伪影（如棋盘格效应、不对称性问题）。 - **StyleGAN3 / Stable Diffusion XL（SDXL） / Midjourney（纹理代理样本）**： - 样本筛选：采用Food101食物数据集的高频纹理样本，以确保生成稳定的高分辨率512像素图像。 - 应用价值：作为结构代理样本，用于训练模型识别高分辨率合成图像中常见的纹理与色彩异常特征。 🛠️ 预处理流程 ------------------------ 1. 格式校验：检查图像文件头完整性，移除损坏的无效文件。 2. 通道标准化：将灰度、CMYK格式的图像统一转换为RGB格式。 3. 兰索斯重采样：采用`PIL.Image.LANCZOS`接口将图像重采样至512×512像素，完整保留光谱细节信息。 4. JPEG压缩保存：以质量因子Q=95的参数保存为JPEG格式，平衡图像保真度与文件体积。 💻 PyTorch使用指南 ----------------------- python import torchvision.datasets as datasets import torchvision.transforms as transforms from torch.utils.data import DataLoader # 1. 定义图像变换流程（图像尺寸已预设为512×512） transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # 2. 加载本地数据集 dataset_path = "./HybridForensics_Dataset_512" dataset = datasets.ImageFolder(root=dataset_path, transform=transform) # 3. 创建数据加载器 dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # 4. 验证数据集类别标签 print(dataset.classes) # ['Fake_Diffusion', 'Fake_GAN', 'Real'] ⚠️ 局限性与伦理说明 ------------------------------ - **代理样本限制**：SDXL、Midjourney与StyleGAN3文件夹采用的是纹理丰富的代理样本，而非直接生成的目标场景图像。该数据集更适用于伪影检测任务，而非语义保真度相关研究。 - **数据集偏差**：源数据集（FFHQ、MS COCO）可能包含人口统计与内容层面的固有偏差。 📄 引用说明 ----------- 若在研究中使用该数据集结构，请引用以下文献： bibtex @dataset{hybridforensics2025, author = {SHANMUKESH BONALA}, title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset}, year = {2025}, publisher = {Hugging Face}, version = {1.0.0} }

提供机构：

Shanmuk4622

5,000+

优质数据集

54 个

任务类型

进入经典数据集