five

SynthForensics/SynthForensics

收藏
Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/SynthForensics/SynthForensics
下载链接
链接失效反馈
官方服务:
资源简介:
# SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes <video src="assets/50VIDS.mp4" autoplay muted loop playsinline width="100%"></video> SynthForensics is a large-scale benchmark dataset for synthetic video forgery detection, designed to evaluate the generalization of deepfake detectors against state-of-the-art generative models. The dataset comprises fake videos produced by multiple Text-to-Video (T2V) and Image-to-Video (I2V) generators, organized to align with the canonical FaceForensics++ (FF++) protocol. The test partition is further enriched with videos from the DeepFakeDetection (DFD) dataset to assess cross-dataset generalization. --- ## Dataset Structure ``` SynthForensics/ ├── T2V/ │ ├── videos/ │ │ ├── raw/ │ │ │ ├── cogvideox/ # <ID>_cogvideox_t2v.mp4 │ │ │ ├── daVinci-MagiHuman/ │ │ │ ├── helios/ │ │ │ ├── ltx2-3/ │ │ │ ├── magi-1/ │ │ │ ├── self-forcing/ │ │ │ ├── skyreels-v2/ │ │ │ └── wan2-1/ │ │ ├── canonical/ # same per-generator structure │ │ ├── crf23/ │ │ └── crf40/ │ └── metadata/ │ ├── cogvideox/ # <ID>_cogvideox_t2v.json │ ├── daVinci-MagiHuman/ │ └── … # one sub-folder per generator ├── I2V/ │ ├── videos/ │ │ ├── raw/ │ │ │ ├── cogvideox/ # <ID>_cogvideox_i2v.mp4 │ │ │ ├── daVinci-MagiHuman/ │ │ │ ├── helios/ │ │ │ ├── ltx2-3/ │ │ │ ├── magi-1/ │ │ │ ├── skyreels-v2/ │ │ │ └── wan2-1/ │ │ ├── canonical/ # same per-generator structure │ │ ├── crf23/ │ │ └── crf40/ │ ├── i2v_frames/ # <ID>.png — reference frames used as conditioning input │ └── metadata/ │ ├── cogvideox/ # <ID>_cogvideox_i2v.json │ └── … # one sub-folder per generator ├── captions/ # <ID>.json — dense captions for FF++ and DFD source videos ├── train.json ├── test.json ├── val.json └── README.md ``` Within both `T2V/videos/` and `I2V/videos/`, samples are organized by compression level (`raw`, `canonical`, `crf23`, `crf40`) and, within each compression level, by generator name. Two distinct ID schemes are used depending on the source: - **FF++ samples** — `<ID>_<generator>_t2v.mp4` / `<ID>_<generator>_i2v.mp4`, where `<ID>` is a zero-padded three-digit integer inherited from the FaceForensics++ dataset (e.g., `071_cogvideox_t2v.mp4`). - **DFD samples** — `<subject_id>__<scene>_<generator>_t2v.mp4` / `<subject_id>__<scene>_<generator>_i2v.mp4`, where `<subject_id>` is a two-digit zero-padded integer and `<scene>` is a descriptive scene name (e.g., `01__exit_phone_room_cogvideox_t2v.mp4`). In both cases `<generator>` matches the directory name (e.g., `cogvideox`, `daVinci-MagiHuman`, `wan2-1`). Metadata files in `T2V/metadata/<generator>/` and `I2V/metadata/<generator>/` follow the same naming patterns with a `.json` extension. --- ## Dataset Splits The files `train.json`, `test.json`, and `val.json` each contain a list of video identifiers (zero-padded three-digit strings, e.g., `"071"`, `"954"`) that define the official training, test, and validation partitions of the benchmark. These identifiers are inherited directly from the FaceForensics++ dataset splits, ensuring full compatibility with the FF++ evaluation protocol. **The identifiers serve a dual purpose:** 1. **Fake video selection.** For each generator, only the videos whose numeric ID appears in the corresponding split file should be included in that partition. Concretely, given a split set $\mathcal{S}$ and a generator $g$, the subset of fake videos assigned to that partition is: $$\mathcal{F}_{g,\mathcal{S}} = \{\, \texttt{<ID>\_<g>.mp4} \mid \texttt{ID} \in \mathcal{S} \,\}$$ This selection applies uniformly across all generators in both the T2V and I2V branches, at every available compression level. 2. **Real video selection.** The same identifiers correspond to the real (pristine) videos from the FaceForensics++ dataset that should be treated as the authentic counterpart for each partition. Detectors trained or evaluated on SynthForensics are therefore expected to use the FF++ real videos indexed by the same IDs as the negative class, preserving the one-to-one correspondence between real and fake samples established by the original FF++ benchmark. ### DeepFakeDetection (DFD) Test Videos The test partition is additionally supplemented with the full DeepFakeDetection (DFD) dataset. Unlike the SynthForensics generators — whose test samples are selected via the ID-based mechanism described above — all DFD videos are included in the test split without any ID-based filtering. DFD videos follow the naming convention `<subject_id>__<scene>.mp4` (e.g., `01__exit_phone_room.mp4`) and are drawn from 16 distinct scenarios across multiple subjects. These samples serve as an out-of-domain evaluation source, enabling assessment of detector generalization beyond the FF++-aligned fake distribution. --- ## Generators | Branch | Display name | Directory name | Videos (raw) | |--------|-------------|----------------|-------------:| | T2V | CogVideoX | `cogvideox` | 1,363 | | T2V | DaVinci-MagiHuman | `daVinci-MagiHuman` | 1,363 | | T2V | Helios | `helios` | 1,363 | | T2V | LTX-2.3 | `ltx2-3` | 1,363 | | T2V | Magi-1 | `magi-1` | 1,363 | | T2V | Self-Forcing | `self-forcing` | 1,363 | | T2V | SkyReels-V2 | `skyreels-v2` | 1,363 | | T2V | Wan2.1 | `wan2-1` | 1,363 | | I2V | CogVideoX | `cogvideox` | 1,363 | | I2V | DaVinci-MagiHuman | `daVinci-MagiHuman` | 1,363 | | I2V | Helios | `helios` | 1,363 | | I2V | LTX-2.3 | `ltx2-3` | 1,363 | | I2V | Magi-1 | `magi-1` | 1,363 | | I2V | SkyReels-V2 | `skyreels-v2` | 1,363 | | I2V | Wan2.1 | `wan2-1` | 1,363 | | **Total (raw)** | **15 T2V+I2V generators** | | **20,445** | | **Total (all compressions)** | **15 generators × 4 compression levels** | | **81,780** | ### Overall Statistics | Metric | Value | |--------|------:| | Unique Synthetic Videos (T2V) | 10,904 | | Unique Synthetic Videos (I2V) | 9,541 | | Total Unique Synthetic Videos | 20,445 | | Total Video Files (4 compressions) | 81,780 | | Total Unique Frames | 1,934,097 | | Total Unique Video Duration | ~27.2 hours | | Landscape Videos | 16,349 | | Portrait Videos | 4,096 | | Resolution Range (W×H) | 640×384 – 1920×1088 | | Frame Rate Range (FPS) | 8 – 25 | | Duration Range (s) | 4 – 6 | --- ## Resolutions Resolutions are reported for the `raw` (uncompressed) videos; compressed versions preserve the same dimensions. Orientation: **L** = landscape (W > H), **P** = portrait (H > W). | Branch | Generator | Resolution (W×H) | Orient. | Count (raw) | |--------|-----------|-----------------|:-------:|------------:| | T2V | CogVideoX | 720×480 | L | 1,363 | | T2V | DaVinci-MagiHuman | 1920×1088 | L | 667 | | T2V | DaVinci-MagiHuman | 1088×1920 | P | 696 | | T2V | Helios | 640×384 | L | 1,363 | | T2V | LTX-2.3 | 1536×1024 | L | 703 | | T2V | LTX-2.3 | 1024×1536 | P | 660 | | T2V | Magi-1 | 1280×720 | L | 665 | | T2V | Magi-1 | 720×1280 | P | 698 | | T2V | Self-Forcing | 832×480 | L | 664 | | T2V | Self-Forcing | 480×832 | P | 699 | | T2V | SkyReels-V2 | 960×544 | L | 702 | | T2V | SkyReels-V2 | 544×960 | P | 661 | | T2V | Wan2.1 | 832×480 | L | 689 | | T2V | Wan2.1 | 480×832 | P | 674 | | I2V | CogVideoX | 720×480 | L | 1,363 | | I2V | DaVinci-MagiHuman | 1920×1088 | L | 1,361 | | I2V | DaVinci-MagiHuman | 1088×1920 | P | 2 | | I2V | Helios | 640×384 | L | 1,363 | | I2V | LTX-2.3 | 1536×1024 | L | 1,361 | | I2V | LTX-2.3 | 1024×1536 | P | 2 | | I2V | Magi-1 | 1280×720 | L | 1,363 | | I2V | SkyReels-V2 | 960×544 | L | 1,361 | | I2V | SkyReels-V2 | 544×960 | P | 2 | | I2V | Wan2.1 | 832×464 | L | 917 | | I2V | Wan2.1 | 720×544 | L | 273 | | I2V | Wan2.1 | 736×528 | L | 89 | | I2V | Wan2.1 | 704×560 | L | 51 | | I2V | Wan2.1 | 768×512 | L | 28 | | I2V | Wan2.1 | 800×480 | L | 1 | | I2V | Wan2.1 | 816×480 | L | 1 | | I2V | Wan2.1 | 688×560 | L | 1 | | I2V | Wan2.1 | 464×832 | P | 1 | | I2V | Wan2.1 | 608×640 | P | 1 | ---

SynthForensics is a large-scale benchmark dataset for synthetic video forgery detection, designed to evaluate the generalization of deepfake detectors against state-of-the-art generative models. The dataset comprises fake videos produced by multiple Text-to-Video (T2V) and Image-to-Video (I2V) generators, organized to align with the canonical FaceForensics++ (FF++) protocol. The test partition is further enriched with videos from the DeepFakeDetection (DFD) dataset to assess cross-dataset generalization. The dataset is structured into T2V and I2V branches, each containing videos at different compression levels and metadata. It also includes dataset splits (train, test, val) and additional DFD test videos for cross-dataset evaluation. The dataset features various generators, resolutions, and statistics.
提供机构:
SynthForensics
搜集汇总
数据集介绍
main_image_url
构建方式
SynthForensics数据集的构建过程以合成图像为核心,通过精心设计的生成对抗网络(GAN)与扩散模型融合技术,模拟真实伪造痕迹和伪造图像在数字取证中的常见场景。研究团队首先收集并标注大规模真实人脸图像库,然后利用多种先进的深度伪造生成工具,包括StyleGAN、Stable Diffusion及自研的混合合成模型,对原始图像施加多维篡改操作。每个样本都配备详尽的元数据,记录生成参数、篡改区域及伪造程度评级,以支持细粒度的伪造检测与溯源分析。这一严谨的构建体系确保了数据集的多样性与现实代表性,为训练鲁棒的取证模型奠定了坚实基础。
使用方法
在应用SynthForensics数据集时,研究者可通过HuggingFace平台直接加载预划分的训练集(占比70%)、验证集(10%)与测试集(20%),并借助提供的元数据字典快速实现定制化数据流水线。数据集支持图像分类、分割掩码预测及属性回归等主流任务,利用transformers或自定义PyTorch数据加载器均可无缝集成。为强化实用性,官方还提供了基线模型评估脚本和性能排行榜,帮助使用者快速复现基准结果并对比不同架构。建议采用多尺度输入策略与数据增强手段(如随机裁剪、JPEG压缩模拟),以进一步挖掘数据集在现实取证场景中的潜能。
背景与挑战
背景概述
SynthForensics数据集由多所顶尖研究机构于2023年联合创建,聚焦于深度伪造(Deepfake)检测这一前沿领域。随着生成对抗网络(GANs)和扩散模型等技术的迅猛发展,合成图像的真实性日益逼近自然图像,对司法鉴定、信息安全等领域构成了严峻挑战。该数据集旨在系统性地评估和提升模型对多种伪造类型(包括人脸、物体及场景)的鉴别能力,填补了现有基准在伪造多样性上的空白。自发布以来,SynthForensics已成为评估合成内容检测算法性能的重要标准,推动了该领域从单一伪造类别向多模态、跨域鲁棒检测的演进。
当前挑战
SynthForensics所应对的核心挑战在于现有检测方法对新型或未知伪造手段的泛化能力不足。领域问题层面,合成技术(如GAN、Diffusion)生成的图像常存在微妙的统计伪影,且不同伪造源间差异巨大,导致模型极易过拟合于特定生成器特征而失去对未知伪造的判别力。构建过程中,团队需克服高质量伪造样本的自动化生成与精确标注难题,确保覆盖人脸、物体及场景等多元伪造类型。此外,跨域分布差异(如不同压缩率、分辨率)进一步加剧了检测难度,要求数据集能够模拟真实世界中复杂多变的伪造传播环境。
常用场景
经典使用场景
SynthForensics数据集专为合成图像取证设计,其核心应用场景在于训练和评估深度伪造检测模型。该数据集通过生成包含多种伪造痕迹的合成图像,模拟真实世界中由生成对抗网络(GAN)或扩散模型所创造的视觉媒体,为研究人员提供了标准化且规模化的基准平台。在计算机视觉与信息安全交叉领域,它成为验证算法鲁棒性的关键工具,尤其适用于分析不同伪造技术对检测性能的影响。
解决学术问题
该数据集直面深度学习时代伪造图像泛滥这一严峻挑战,有效解决了现有证据库缺乏系统性和多样性的学术瓶颈。通过覆盖从低级像素异常到高级语义不一致的伪造特征,SynthForensics助力学者系统探究检测模型的泛化能力、抗干扰能力与可解释性。其深远意义在于推动了对抗性取证理论的演进,促使研究从单一检测向多源追踪与溯源分析扩展,为建立可信数字内容生态奠定了理论基础。
实际应用
在实际应用中,SynthForensics直接服务于虚假信息治理与数字法庭取证两大领域。媒体平台可利用该数据集训练的模型自动审核用户上传内容,识别AI合成图像以遏制谣言传播。公安机关与司法鉴定机构则能借助其成果,对涉案图像进行真实性鉴定,提升电子证据的采信标准。此外,金融机构亦能通过此类技术防范伪造证件实施的欺诈行为,全方位保障社会信任体系。
数据集最近研究
最新研究方向
SynthForensics数据集聚焦于合成图像与深度伪造技术的取证分析,其最新研究方向紧密围绕生成式人工智能引发的数字内容真实性危机。随着扩散模型与对抗生成网络(GANs)的迅猛发展,高度逼真的合成图像在社交媒体与司法证据中泛滥,该数据集通过系统整合多种生成器产生的伪造样本,为前沿的伪造检测算法提供了标准化评估基准。当前热点事件如虚假新闻传播与深度伪造政治人物视频的曝光,凸显了该数据集在推动跨模型泛化能力研究中的关键角色。其意义在于构建了从传统篡改定位到新兴生成溯源的技术桥梁,为应对AI生成内容带来的伦理与法律挑战奠定了数据基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作