AVCaps: An audio-visual dataset with modality-specific captions

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14536324

下载链接

链接失效反馈

官方服务：

资源简介：

The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours. For each clip, the dataset provides: Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators. Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio. Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions. GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions. AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities. The video clips are provided in three ZIP files: train_videos.zip: 1661 training clips. val_videos.zip: 200 validation clips. test_videos.zip: 200 testing clips. The captions are available in three JSON files: train_captions.json val_captions.json test_captions.json Each JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.

创建时间：

2024-12-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集