AVCaps: An audio-visual dataset with modality-specific captions
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14536324
下载链接
链接失效反馈官方服务:
资源简介:
The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours.
For each clip, the dataset provides:
Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators.
Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio.
Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions.
GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions.
AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities.
The video clips are provided in three ZIP files:
train_videos.zip: 1661 training clips.
val_videos.zip: 200 validation clips.
test_videos.zip: 200 testing clips.
The captions are available in three JSON files:
train_captions.json
val_captions.json
test_captions.json
Each JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.
创建时间:
2024-12-20



