ngqtrung/full-modality-video-caption
收藏Hugging Face2025-10-21 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/ngqtrung/full-modality-video-caption
下载链接
链接失效反馈官方服务:
资源简介:
Full Modality Video Caption Dataset是一个大规模的多模态视频数据集,包含视觉、音频和综合描述。该数据集共有55,940个视频段,每个10秒钟。视频段包括三种类型的描述:视觉描述(由GPT-4o生成)、音频描述(由Qwen3-Omni-30B-A3B-Captioner生成)和综合描述(由Qwen3-Omni-30B-A3B-Instruct生成)。数据集以WebDataset格式提供,包含视频文件和JSON格式的元数据。
The Full Modality Video Caption Dataset is a large-scale multimodal video dataset that includes comprehensive vision, audio, and integrated captions. It contains 55,940 video segments, each 10 seconds long, with three types of captions: vision captions (generated by GPT-4o), audio captions (generated by Qwen3-Omni-30B-A3B-Captioner), and video captions (an integrated multi-modal description generated by Qwen3-Omni-30B-A3B-Instruct). The dataset is provided in the WebDataset format, including video files and metadata in JSON format.
提供机构:
ngqtrung



