SWAN-DF database of audio-video deepfakes

Mendeley Data2024-05-10 更新2024-06-28 收录

下载链接：

https://zenodo.org/records/8365616

下载链接

链接失效反馈

官方服务：

资源简介：

Description SWAN-DF: the first high fidelity publicly available dataset of realistic audio-visual deepfakes, where both faces and voices appear and sound like the target person. The SWAN-DF dataset is based on the public SWAN database of real videos recorded in HD on iPhone and iPad Pro (in year 2019). For 30 pairs of manually selected people from SWAN, we swapped faces and voices using several autoencoder-based face swapping models and using several blending techniques from the well-known open source repo DeepFaceLab and voice conversion (or voice cloning) methods, including zero-shot YourTTS, DiffVC, HiFiVC, and several models from FreeVC. For each model and each blending technique, there are 960 video deepfakes. We used three types of models of the following resolutions: 160x160, 256x256, and 320x320 pixels. We took one pre-trained model corresponding for each resolution, and tuned it for each of the 30 pairs (both ways) of subjects for 50K iterations. Then, when generating deepfake videos for each pair of subjects, we used one of the tuned models and a way to blend the generated image back into the original frame, which we call blending technique. SWAN-DF dataset contains 25 different combinations of models and blending, which means the total number of deepfake videos is 960*25=24000. We generated speech deepfakes using four voice conversion methods: YourTTS, HiFiVC, DiffVC, and FreeVC. We did not use text to speech methods for our video deepfakes, since the speech they produce is not synchronized with the lip movements in the video. For YourTTS, HiFiVC, and DiffVC methods, we used the pretrained models provided by the authors. HiFiVC was pretrained on VCTK, DiffVC on LibriTTS, and YourTTS on both VCTK and LibriTTS datasets. For FreeVC, we generated audio deepfakes for several variants: using the provided pretrained models (for 16Hz with and without pretrained speaker encoder and for 24Hz with pretrained speaker encoder) as is and by tuning 16Hz model either from scratch or starting from the pretrained version for different number of iterations on the mixture of VCTK and SWAN data. In total, SWAN-DF contains 12 different variations of audio deepfakes: one for each of YourTTS, HiFiVC, and DiffVC and 9 variants of FreeVC. Acknowledgements If you use this database, please cite the following publication: Pavel Korshunov, Haolin Chen, Philip N. Garner, and Sébastien Marcel, "Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes", IEEE International Joint Conference on Biometrics (IJCB), September 2023. https://publications.idiap.ch/publications/show/5092

数据集说明 SWAN-DF：首个可公开获取的高保真现实视听深度伪造（audio-visual deepfake）数据集，其中人脸与语音均高度贴合目标人物。SWAN-DF数据集基于2019年使用iPhone与iPad Pro录制的高清真实视频公开数据集SWAN构建。研究团队从SWAN数据集中手动挑选30组人物对，采用多款基于自编码器（autoencoder）的人脸换脸模型，结合知名开源仓库DeepFaceLab中的多种融合技术，以及语音转换（voice conversion，又称语音克隆voice cloning）方法，生成了人脸与语音均被替换的视频。所使用的语音转换方法包括零样本（zero-shot）YourTTS、DiffVC、HiFiVC，以及FreeVC的多款模型。针对每种模型与每种融合技术，共生成960个视频深度伪造样本。本次实验使用了三种分辨率的模型：160×160、256×256与320×320像素。针对每种分辨率选取一款预训练模型，并针对30组人物对（双向）进行5万次迭代微调。在为每组人物对生成深度伪造视频时，会使用其中一款微调后的模型，并采用将生成图像融合回原始帧的方式，即本文所称的融合技术。SWAN-DF数据集共包含25种模型与融合技术的组合，因此总深度伪造视频数量为960×25=24000条。本次研究使用四种语音转换方法生成语音深度伪造样本：YourTTS、HiFiVC、DiffVC与FreeVC。未采用文本转语音（text to speech）方法生成视频深度伪造样本，因为此类方法生成的语音无法与视频中的唇部动作同步。对于YourTTS、HiFiVC和DiffVC，直接使用作者提供的预训练模型：HiFiVC在VCTK数据集上预训练，DiffVC在LibriTTS数据集上预训练，YourTTS则同时在VCTK与LibriTTS数据集上预训练。对于FreeVC，研究团队生成了多种变体的音频深度伪造样本：直接使用官方提供的预训练模型（包括16Hz带/不带预训练说话人编码器，以及24Hz带预训练说话人编码器的模型），同时对16Hz模型分别进行从零开始微调，或基于预训练版本在VCTK与SWAN数据的混合集上进行不同迭代轮次的微调。综上，SWAN-DF共包含12种音频深度伪造变体：YourTTS、HiFiVC、DiffVC各1种，FreeVC共9种变体。致谢若使用本数据集，请引用以下文献：Pavel Korshunov、Haolin Chen、Philip N. Garner与Sébastien Marcel，《自动身份识别对视听深度伪造的脆弱性》，IEEE国际生物特征识别联合会议（IEEE International Joint Conference on Biometrics, IJCB），2023年9月。链接：https://publications.idiap.ch/publications/show/5092

创建时间：

2023-09-25

搜集汇总

数据集介绍

背景与挑战

背景概述

SWAN-DF是首个公开的高保真音频-视频深度伪造数据集，包含基于SWAN数据库生成的24,000个深度伪造视频，使用了多种面部交换模型和语音转换技术。数据集特别强调了面部和声音的高逼真度，适用于非商业研究用途。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集