five

WailyWang/VCapAV

收藏
Hugging Face2025-11-28 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/WailyWang/VCapAV
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-to-audio language: - en tags: - audio - environmental sound - deepfake - TTA - V2A size_categories: - 100K<n<1M --- # Dataset Card for VCapAV VCapAV is a large-scale audio-visual deepfake detection dataset focused on **non-speech environmental sounds**. It introduces new multimodal deepfake scenarios using both **Text-to-Audio (TTA)** and **Video-to-Audio (V2A)** pipelines, together with **Text-to-Video (TTV)** synthesis. The dataset contains **90,990 clips, totaling 252.75 hours**, and supports audio-only, visual-only, and audio-visual detection tasks. ### Dataset Description VCapAV addresses the lack of multimodal deepfake data involving environmental sounds. Unlike existing datasets focused on speech or face-centric manipulations, VCapAV introduces a comprehensive set of environmental audio generation methods and high-fidelity video forgeries. - **Curated by:** Duke Kunshan University, University of Yamanashi, Wuhan University - **Funded by:** DKU Foundation Project “Emerging AI Technologies for Natural Language Processing” - **Shared by:** Authors of the VCapAV paper - **Language(s):** English (captions) - **License:** MIT License ### Dataset Sources - **Repository:** https://github.com/wailywang/VCapAV/ - **Paper:** [*VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset*](https://www.isca-archive.org/interspeech_2025/wang25q_interspeech.html) - **Demo:** https://vcapav.github.io/ ### Dataset Uses - Audio anti-spoofing research - Audio-visual deepfake detection - Evaluation of general-purpose audio generation methods - Studying modality consistency between vision and sound - Research on multimodal synchronization, scene-aware generation, and cross-modal alignment ### Dataset Creation Most deepfake datasets focus on speech or human faces. VCapAV fills this gap by focusing on **general environmental audio** and **video–audio consistency**, enabling research on non-speech deepfake detection. The dataset is constructed from a subset of **VGGSound** (15,446 videos). ### Citation ```bibtex @inproceedings{wang2025vcapav, title={VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset}, author={Wang, Yuxi and Wang, Yikang and Zhang, Qishan and Nishizaki, Hiromitsu and Li, Ming}, booktitle={Interspeech}, year={2025} } ``` ---
提供机构:
WailyWang
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作