five

AVCaps: An audio-visual dataset with modality-specific captions

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14536324
下载链接
链接失效反馈
官方服务:
资源简介:
The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours. For each clip, the dataset provides: Audio Captions: Up to 5 textual captions describing only the audio content, crowdsourced from annotators. Visual Captions: Up to 5 textual captions focusing solely on the visual content, annotated without access to audio. Audio-Visual Captions: Up to 5 captions describing the combined audio and visual content, capturing multimodal interactions. GPT-4 Generated Captions: Three additional audio-visual captions per clip, synthesized from the crowdsourced captions. AVCaps is a valuable resource for researchers working on tasks such as multimodal captioning, audio-visual alignment, and video content understanding. By providing separate and combined modality-specific annotations, it enables fine-grained studies in the interaction and alignment of audio and visual modalities. The video clips are provided in three ZIP files: train_videos.zip: 1661 training clips. val_videos.zip: 200 validation clips. test_videos.zip: 200 testing clips. The captions are available in three JSON files: train_captions.json val_captions.json test_captions.json Each JSON file contains entries with video filenames as keys, and the corresponding values include audio captions, visual captions, audio-visual captions, and LLM-generated audio-visual captions.
创建时间:
2024-12-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作