WailyWang/VCapAV
收藏Hugging Face2025-11-28 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/WailyWang/VCapAV
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-to-audio
language:
- en
tags:
- audio
- environmental sound
- deepfake
- TTA
- V2A
size_categories:
- 100K<n<1M
---
# Dataset Card for VCapAV
VCapAV is a large-scale audio-visual deepfake detection dataset focused on **non-speech environmental sounds**. It introduces new multimodal deepfake scenarios using both **Text-to-Audio (TTA)** and **Video-to-Audio (V2A)** pipelines, together with **Text-to-Video (TTV)** synthesis.
The dataset contains **90,990 clips, totaling 252.75 hours**, and supports audio-only, visual-only, and audio-visual detection tasks.
### Dataset Description
VCapAV addresses the lack of multimodal deepfake data involving environmental sounds. Unlike existing datasets focused on speech or face-centric manipulations, VCapAV introduces a comprehensive set of environmental audio generation methods and high-fidelity video forgeries.
- **Curated by:** Duke Kunshan University, University of Yamanashi, Wuhan University
- **Funded by:** DKU Foundation Project “Emerging AI Technologies for Natural Language Processing”
- **Shared by:** Authors of the VCapAV paper
- **Language(s):** English (captions)
- **License:** MIT License
### Dataset Sources
- **Repository:** https://github.com/wailywang/VCapAV/
- **Paper:** [*VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset*](https://www.isca-archive.org/interspeech_2025/wang25q_interspeech.html)
- **Demo:** https://vcapav.github.io/
### Dataset Uses
- Audio anti-spoofing research
- Audio-visual deepfake detection
- Evaluation of general-purpose audio generation methods
- Studying modality consistency between vision and sound
- Research on multimodal synchronization, scene-aware generation, and cross-modal alignment
### Dataset Creation
Most deepfake datasets focus on speech or human faces. VCapAV fills this gap by focusing on **general environmental audio** and **video–audio consistency**, enabling research on non-speech deepfake detection.
The dataset is constructed from a subset of **VGGSound** (15,446 videos).
### Citation
```bibtex
@inproceedings{wang2025vcapav,
title={VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset},
author={Wang, Yuxi and Wang, Yikang and Zhang, Qishan and Nishizaki, Hiromitsu and Li, Ming},
booktitle={Interspeech},
year={2025}
}
```
---
提供机构:
WailyWang



