Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
---
HybridForensicsNet: Standardized 512px Image Forensics Dataset
=============================================================
📌 Executive Summary
--------------------
HybridForensicsNet is a curated, balanced dataset for training and benchmarking digital image forensic algorithms. It targets a **hybrid threat model** covering both Generative Adversarial Networks (GANs) and Latent Diffusion Models (LDMs). All 10,000 images are standardized to **512×512** using high-quality Lanczos resampling, keeping high-frequency artifacts intact for CNNs and ViTs.
📊 Dataset Specifications
------------------------
| Metric | Value | Description |
| ------------- | ------------ | ------------------------------------------------ |
| Total Images | 10,000 | Balanced across Real and Fake classes |
| Total Size | ~780 MB | High-quality JPEG compression |
| Format | JPEG (Q=95) | Quality factor 95 |
| Resolution | 512 × 512 | Fixed dimensions (Lanczos resampled) |
| Channels | RGB | 3 channels (standardized) |
| Balance | 50% Real/Fake| Strictly balanced to prevent prior bias |
📂 Directory Structure
---------------------
The dataset follows the standard `ImageFolder` layout for PyTorch/TensorFlow loaders.
```
HybridForensics_Dataset_512/
├── Real/ (5,000 images)
│ ├── FFHQ/ (2,500) - High-quality human faces
│ └── MS_COCO/ (2,500) - General objects & scenes
│
├── Fake_GAN/ (2,500 images)
│ ├── ProGAN/ (1,250) - GAN-generated anime/faces
│ └── StyleGAN3/ (1,250) - Structural/texture proxy*
│
└── Fake_Diffusion/ (2,500 images)
├── SDXL/ (1,250) - Structural/texture proxy*
└── Midjourney/ (1,250) - Structural/texture proxy*
```
🧠 Data Composition & Provenance
-------------------------------
**Class 0: REAL (Authentic Imagery)**
- **FFHQ (Flickr-Faces-HQ):** Official NVIDIA mirror; diverse, high-quality portraits.
- **MS COCO:** Real-world scenes, organic textures, man-made objects; mitigates overfitting to faces.
**Class 1: FAKE (Synthetic Imagery)**
- **ProGAN:** AnimeFace-derived; exhibits early GAN artifacts (checkerboarding, asymmetry).
- **StyleGAN3 / SDXL / Midjourney (Texture Proxies):**
- *Curation:* High-frequency texture samples (Food101) to guarantee stable, high-res 512px imagery.
- *Utility:* Structural proxies for training on texture/color anomalies common to high-res synthesis.
🛠️ Preprocessing Pipeline
------------------------
1) Format validation: header integrity checked; corrupted files removed.
2) Channel standardization: grayscale/CMYK → RGB.
3) Lanczos resampling to 512×512 (PIL.Image.LANCZOS) to preserve spectral detail.
4) JPEG save at Q=95 to balance fidelity and size.
💻 Usage Guide (PyTorch)
-----------------------
```python
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# 1. Define transformations (size already 512x512)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# 2. Load dataset
dataset_path = "./HybridForensics_Dataset_512"
dataset = datasets.ImageFolder(root=dataset_path, transform=transform)
# 3. Create loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# 4. Verify classes
print(dataset.classes)
# ['Fake_Diffusion', 'Fake_GAN', 'Real']
```
⚠️ Limitations & Ethical Notes
------------------------------
- **Proxy data:** SDXL, Midjourney, and StyleGAN3 folders use texture-rich proxies, not direct generations. Best suited for artifact detection, not semantic fidelity studies.
- **Bias:** Source datasets (FFHQ, COCO) may contain demographic and content biases.
📄 Citation
-----------
If you use this dataset structure in your research, please cite:
```
@dataset{hybridforensics2025,
author = {SHANMUKESH BONALA},
title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset},
year = {2025},
publisher = {Hugging Face},
version = {1.0.0}
}
```
---许可协议:知识共享署名-非商业性使用4.0许可(CC BY-NC 4.0)---
# HybridForensicsNet:标准化512像素图像取证数据集
=============================================================
📌 核心概要
--------------------
HybridForensicsNet是一个经过精心筛选、类别均衡的数据集,用于训练与评测数字图像取证算法。该数据集针对**混合威胁模型(hybrid threat model)**设计,同时覆盖生成对抗网络(Generative Adversarial Networks,GANs)与潜在扩散模型(Latent Diffusion Models,LDMs)两类生成式伪造源。全部10000张图像均通过高质量兰索斯重采样(Lanczos resampling)统一调整为**512×512像素**,完整保留高频伪影特征,适配卷积神经网络(CNNs)与视觉Transformer(ViTs)模型的特征提取需求。
📊 数据集规格参数
------------------------
| 指标 | 数值 | 说明 |
| ------------- | ------------ | ------------------------------------------------ |
| 总图像数 | 10,000 | 真实、伪造两类样本严格均衡分布 |
| 总文件大小 | 约780 MB | 采用高质量JPEG压缩格式 |
| 图像格式 | JPEG(Q=95) | 质量因子设置为95 |
| 图像分辨率 | 512 × 512 | 固定尺寸(经兰索斯重采样处理) |
| 通道数 | RGB | 标准化为3通道RGB格式 |
| 类别均衡度 | 50% 真实/50% 伪造 | 严格平衡类别占比,避免模型产生先验偏差 |
📂 目录结构
--------------------
该数据集采用PyTorch与TensorFlow通用的标准`ImageFolder`目录布局,便于快速加载。
HybridForensics_Dataset_512/
├── Real/ (共5000张图像)
│ ├── FFHQ/ (2500张)——Flickr人脸高清数据集,高质量人像样本
│ └── MS_COCO/ (2500张)——微软通用图像标注数据集,涵盖通用物体与场景
│
├── Fake_GAN/ (共2500张图像)
│ ├── ProGAN/ (1250张)——渐进式生成对抗网络生成的动漫/人脸图像
│ └── StyleGAN3/ (1250张)——结构/纹理代理样本*
│
└── Fake_Diffusion/ (共2500张图像)
├── SDXL/ (1250张)——Stable Diffusion XL(SDXL)生成的结构/纹理代理样本*
└── Midjourney/ (1250张)——Midjourney生成的结构/纹理代理样本*
🧠 数据组成与来源
-------------------------------
### 类别0:真实图像(Authentic Imagery)
- **Flickr人脸高清数据集(FFHQ)**:官方NVIDIA镜像资源,涵盖多样化高质量人像肖像。
- **微软通用图像标注数据集(MS COCO)**:包含真实世界场景、自然纹理与人工造物样本,可缓解模型对人脸类别的过拟合风险。
### 类别1:伪造图像(Synthetic Imagery)
- **ProGAN**:基于AnimeFace衍生的生成模型,会呈现早期GAN伪影(如棋盘格效应、不对称性问题)。
- **StyleGAN3 / Stable Diffusion XL(SDXL) / Midjourney(纹理代理样本)**:
- 样本筛选:采用Food101食物数据集的高频纹理样本,以确保生成稳定的高分辨率512像素图像。
- 应用价值:作为结构代理样本,用于训练模型识别高分辨率合成图像中常见的纹理与色彩异常特征。
🛠️ 预处理流程
------------------------
1. 格式校验:检查图像文件头完整性,移除损坏的无效文件。
2. 通道标准化:将灰度、CMYK格式的图像统一转换为RGB格式。
3. 兰索斯重采样:采用`PIL.Image.LANCZOS`接口将图像重采样至512×512像素,完整保留光谱细节信息。
4. JPEG压缩保存:以质量因子Q=95的参数保存为JPEG格式,平衡图像保真度与文件体积。
💻 PyTorch使用指南
-----------------------
python
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# 1. 定义图像变换流程(图像尺寸已预设为512×512)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# 2. 加载本地数据集
dataset_path = "./HybridForensics_Dataset_512"
dataset = datasets.ImageFolder(root=dataset_path, transform=transform)
# 3. 创建数据加载器
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# 4. 验证数据集类别标签
print(dataset.classes)
# ['Fake_Diffusion', 'Fake_GAN', 'Real']
⚠️ 局限性与伦理说明
------------------------------
- **代理样本限制**:SDXL、Midjourney与StyleGAN3文件夹采用的是纹理丰富的代理样本,而非直接生成的目标场景图像。该数据集更适用于伪影检测任务,而非语义保真度相关研究。
- **数据集偏差**:源数据集(FFHQ、MS COCO)可能包含人口统计与内容层面的固有偏差。
📄 引用说明
-----------
若在研究中使用该数据集结构,请引用以下文献:
bibtex
@dataset{hybridforensics2025,
author = {SHANMUKESH BONALA},
title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset},
year = {2025},
publisher = {Hugging Face},
version = {1.0.0}
}
提供机构:
Shanmuk4622



