WailyWang/VCapAV

Name: WailyWang/VCapAV
Creator: WailyWang
Published: 2025-11-28 08:14:11
License: 暂无描述

Hugging Face2025-11-28 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/WailyWang/VCapAV

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-to-audio language: - en tags: - audio - environmental sound - deepfake - TTA - V2A size_categories: - 100K<n<1M --- # Dataset Card for VCapAV VCapAV is a large-scale audio-visual deepfake detection dataset focused on **non-speech environmental sounds**. It introduces new multimodal deepfake scenarios using both **Text-to-Audio (TTA)** and **Video-to-Audio (V2A)** pipelines, together with **Text-to-Video (TTV)** synthesis. The dataset contains **90,990 clips, totaling 252.75 hours**, and supports audio-only, visual-only, and audio-visual detection tasks. ### Dataset Description VCapAV addresses the lack of multimodal deepfake data involving environmental sounds. Unlike existing datasets focused on speech or face-centric manipulations, VCapAV introduces a comprehensive set of environmental audio generation methods and high-fidelity video forgeries. - **Curated by:** Duke Kunshan University, University of Yamanashi, Wuhan University - **Funded by:** DKU Foundation Project “Emerging AI Technologies for Natural Language Processing” - **Shared by:** Authors of the VCapAV paper - **Language(s):** English (captions) - **License:** MIT License ### Dataset Sources - **Repository:** https://github.com/wailywang/VCapAV/ - **Paper:** [*VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset*](https://www.isca-archive.org/interspeech_2025/wang25q_interspeech.html) - **Demo:** https://vcapav.github.io/ ### Dataset Uses - Audio anti-spoofing research - Audio-visual deepfake detection - Evaluation of general-purpose audio generation methods - Studying modality consistency between vision and sound - Research on multimodal synchronization, scene-aware generation, and cross-modal alignment ### Dataset Creation Most deepfake datasets focus on speech or human faces. VCapAV fills this gap by focusing on **general environmental audio** and **video–audio consistency**, enabling research on non-speech deepfake detection. The dataset is constructed from a subset of **VGGSound** (15,446 videos). ### Citation ```bibtex @inproceedings{wang2025vcapav, title={VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset}, author={Wang, Yuxi and Wang, Yikang and Zhang, Qishan and Nishizaki, Hiromitsu and Li, Ming}, booktitle={Interspeech}, year={2025} } ``` ---

提供机构：

WailyWang

5,000+

优质数据集

54 个

任务类型

进入经典数据集