five

Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection

收藏
Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Shanmuk4622/HybridForensicsNet-Standardized-Dataset-for-GAN-Diffusion-Detection
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 --- HybridForensicsNet: Standardized 512px Image Forensics Dataset ============================================================= 📌 Executive Summary -------------------- HybridForensicsNet is a curated, balanced dataset for training and benchmarking digital image forensic algorithms. It targets a **hybrid threat model** covering both Generative Adversarial Networks (GANs) and Latent Diffusion Models (LDMs). All 10,000 images are standardized to **512×512** using high-quality Lanczos resampling, keeping high-frequency artifacts intact for CNNs and ViTs. 📊 Dataset Specifications ------------------------ | Metric | Value | Description | | ------------- | ------------ | ------------------------------------------------ | | Total Images | 10,000 | Balanced across Real and Fake classes | | Total Size | ~780 MB | High-quality JPEG compression | | Format | JPEG (Q=95) | Quality factor 95 | | Resolution | 512 × 512 | Fixed dimensions (Lanczos resampled) | | Channels | RGB | 3 channels (standardized) | | Balance | 50% Real/Fake| Strictly balanced to prevent prior bias | 📂 Directory Structure --------------------- The dataset follows the standard `ImageFolder` layout for PyTorch/TensorFlow loaders. ``` HybridForensics_Dataset_512/ ├── Real/ (5,000 images) │ ├── FFHQ/ (2,500) - High-quality human faces │ └── MS_COCO/ (2,500) - General objects & scenes │ ├── Fake_GAN/ (2,500 images) │ ├── ProGAN/ (1,250) - GAN-generated anime/faces │ └── StyleGAN3/ (1,250) - Structural/texture proxy* │ └── Fake_Diffusion/ (2,500 images) ├── SDXL/ (1,250) - Structural/texture proxy* └── Midjourney/ (1,250) - Structural/texture proxy* ``` 🧠 Data Composition & Provenance ------------------------------- **Class 0: REAL (Authentic Imagery)** - **FFHQ (Flickr-Faces-HQ):** Official NVIDIA mirror; diverse, high-quality portraits. - **MS COCO:** Real-world scenes, organic textures, man-made objects; mitigates overfitting to faces. **Class 1: FAKE (Synthetic Imagery)** - **ProGAN:** AnimeFace-derived; exhibits early GAN artifacts (checkerboarding, asymmetry). - **StyleGAN3 / SDXL / Midjourney (Texture Proxies):** - *Curation:* High-frequency texture samples (Food101) to guarantee stable, high-res 512px imagery. - *Utility:* Structural proxies for training on texture/color anomalies common to high-res synthesis. 🛠️ Preprocessing Pipeline ------------------------ 1) Format validation: header integrity checked; corrupted files removed. 2) Channel standardization: grayscale/CMYK → RGB. 3) Lanczos resampling to 512×512 (PIL.Image.LANCZOS) to preserve spectral detail. 4) JPEG save at Q=95 to balance fidelity and size. 💻 Usage Guide (PyTorch) ----------------------- ```python import torchvision.datasets as datasets import torchvision.transforms as transforms from torch.utils.data import DataLoader # 1. Define transformations (size already 512x512) transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # 2. Load dataset dataset_path = "./HybridForensics_Dataset_512" dataset = datasets.ImageFolder(root=dataset_path, transform=transform) # 3. Create loader dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # 4. Verify classes print(dataset.classes) # ['Fake_Diffusion', 'Fake_GAN', 'Real'] ``` ⚠️ Limitations & Ethical Notes ------------------------------ - **Proxy data:** SDXL, Midjourney, and StyleGAN3 folders use texture-rich proxies, not direct generations. Best suited for artifact detection, not semantic fidelity studies. - **Bias:** Source datasets (FFHQ, COCO) may contain demographic and content biases. 📄 Citation ----------- If you use this dataset structure in your research, please cite: ``` @dataset{hybridforensics2025, author = {SHANMUKESH BONALA}, title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset}, year = {2025}, publisher = {Hugging Face}, version = {1.0.0} } ```

---许可协议:知识共享署名-非商业性使用4.0许可(CC BY-NC 4.0)--- # HybridForensicsNet:标准化512像素图像取证数据集 ============================================================= 📌 核心概要 -------------------- HybridForensicsNet是一个经过精心筛选、类别均衡的数据集,用于训练与评测数字图像取证算法。该数据集针对**混合威胁模型(hybrid threat model)**设计,同时覆盖生成对抗网络(Generative Adversarial Networks,GANs)与潜在扩散模型(Latent Diffusion Models,LDMs)两类生成式伪造源。全部10000张图像均通过高质量兰索斯重采样(Lanczos resampling)统一调整为**512×512像素**,完整保留高频伪影特征,适配卷积神经网络(CNNs)与视觉Transformer(ViTs)模型的特征提取需求。 📊 数据集规格参数 ------------------------ | 指标 | 数值 | 说明 | | ------------- | ------------ | ------------------------------------------------ | | 总图像数 | 10,000 | 真实、伪造两类样本严格均衡分布 | | 总文件大小 | 约780 MB | 采用高质量JPEG压缩格式 | | 图像格式 | JPEG(Q=95) | 质量因子设置为95 | | 图像分辨率 | 512 × 512 | 固定尺寸(经兰索斯重采样处理) | | 通道数 | RGB | 标准化为3通道RGB格式 | | 类别均衡度 | 50% 真实/50% 伪造 | 严格平衡类别占比,避免模型产生先验偏差 | 📂 目录结构 -------------------- 该数据集采用PyTorch与TensorFlow通用的标准`ImageFolder`目录布局,便于快速加载。 HybridForensics_Dataset_512/ ├── Real/ (共5000张图像) │ ├── FFHQ/ (2500张)——Flickr人脸高清数据集,高质量人像样本 │ └── MS_COCO/ (2500张)——微软通用图像标注数据集,涵盖通用物体与场景 │ ├── Fake_GAN/ (共2500张图像) │ ├── ProGAN/ (1250张)——渐进式生成对抗网络生成的动漫/人脸图像 │ └── StyleGAN3/ (1250张)——结构/纹理代理样本* │ └── Fake_Diffusion/ (共2500张图像) ├── SDXL/ (1250张)——Stable Diffusion XL(SDXL)生成的结构/纹理代理样本* └── Midjourney/ (1250张)——Midjourney生成的结构/纹理代理样本* 🧠 数据组成与来源 ------------------------------- ### 类别0:真实图像(Authentic Imagery) - **Flickr人脸高清数据集(FFHQ)**:官方NVIDIA镜像资源,涵盖多样化高质量人像肖像。 - **微软通用图像标注数据集(MS COCO)**:包含真实世界场景、自然纹理与人工造物样本,可缓解模型对人脸类别的过拟合风险。 ### 类别1:伪造图像(Synthetic Imagery) - **ProGAN**:基于AnimeFace衍生的生成模型,会呈现早期GAN伪影(如棋盘格效应、不对称性问题)。 - **StyleGAN3 / Stable Diffusion XL(SDXL) / Midjourney(纹理代理样本)**: - 样本筛选:采用Food101食物数据集的高频纹理样本,以确保生成稳定的高分辨率512像素图像。 - 应用价值:作为结构代理样本,用于训练模型识别高分辨率合成图像中常见的纹理与色彩异常特征。 🛠️ 预处理流程 ------------------------ 1. 格式校验:检查图像文件头完整性,移除损坏的无效文件。 2. 通道标准化:将灰度、CMYK格式的图像统一转换为RGB格式。 3. 兰索斯重采样:采用`PIL.Image.LANCZOS`接口将图像重采样至512×512像素,完整保留光谱细节信息。 4. JPEG压缩保存:以质量因子Q=95的参数保存为JPEG格式,平衡图像保真度与文件体积。 💻 PyTorch使用指南 ----------------------- python import torchvision.datasets as datasets import torchvision.transforms as transforms from torch.utils.data import DataLoader # 1. 定义图像变换流程(图像尺寸已预设为512×512) transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # 2. 加载本地数据集 dataset_path = "./HybridForensics_Dataset_512" dataset = datasets.ImageFolder(root=dataset_path, transform=transform) # 3. 创建数据加载器 dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # 4. 验证数据集类别标签 print(dataset.classes) # ['Fake_Diffusion', 'Fake_GAN', 'Real'] ⚠️ 局限性与伦理说明 ------------------------------ - **代理样本限制**:SDXL、Midjourney与StyleGAN3文件夹采用的是纹理丰富的代理样本,而非直接生成的目标场景图像。该数据集更适用于伪影检测任务,而非语义保真度相关研究。 - **数据集偏差**:源数据集(FFHQ、MS COCO)可能包含人口统计与内容层面的固有偏差。 📄 引用说明 ----------- 若在研究中使用该数据集结构,请引用以下文献: bibtex @dataset{hybridforensics2025, author = {SHANMUKESH BONALA}, title = {HybridForensicsNet: Standardized 512px Image Forensics Dataset}, year = {2025}, publisher = {Hugging Face}, version = {1.0.0} }
提供机构:
Shanmuk4622
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作