InternVid全集

Name: InternVid全集
Creator: shepshep
Published: 2026-05-24 05:30:58
License: 暂无描述

OpenDataLab2026-05-24 更新2024-06-08 收录

下载链接：

https://opendatalab.org.cn/shepshep/InternVidFull

下载链接

链接失效反馈

官方服务：

资源简介：

InternVid是一个大规模的以视频为中心的多模态数据集，可用于学习强大且可迁移的视频-文本表示，用于多模态理解和生成。InternVid数据集包含超过700万个视频，总时长近76万小时，共有2.34亿个视频片段，伴随着总计41亿个单词的详细描述。我们的核心贡献在于开发了一种可扩展的方法，利用语言模型自主构建高质量的视频-文本数据集，并展示了其在大规模学习视频-语言表示方面的有效性。具体而言，我们采用了多尺度方法来生成与视频相关的描述。此外，我们引入了基于ViT-L的视频-文本表示学习模型ViCLIP。通过对InternVid进行对比学习，该模型展示了领先的零样本动作识别和竞争性的视频检索性能。除了基本的视频理解任务，如识别和检索，我们的数据集和模型还具有广泛的应用。它们特别有助于生成交错的视频-文本数据，用于学习视频为中心的对话系统，并推进视频到文本和文本到视频的生成研究。这些提出的资源为对多模态视频理解和生成感兴趣的研究人员和实践者提供了一个工具。

InternVid is a large-scale video-centric multimodal dataset developed for learning robust and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos with a total duration of nearly 760,000 hours, comprising 234 million video clips accompanied by detailed captions totaling 4.1 billion words. Our core contribution lies in developing a scalable method to autonomously construct high-quality video-text datasets using large language models, and demonstrating its effectiveness in large-scale video-language representation learning. Specifically, we adopt a multi-scale approach to generate video-relevant captions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Trained via contrastive learning on InternVid, this model achieves state-of-the-art zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks such as recognition and retrieval, our dataset and model have a wide range of applications. They are particularly helpful for generating interleaved video-text data to train video-centric dialogue systems, and advancing research in video-to-text and text-to-video generation. These proposed resources provide a valuable tool for researchers and practitioners interested in multimodal video understanding and generation.

提供机构：

shepshep

创建时间：

2024-06-03

搜集汇总

数据集介绍

背景与挑战

背景概述

InternVid是一个大规模的视频-文本多模态数据集，包含超过700万个视频和2.34亿个片段，总时长近76万小时，并伴有41亿单词的详细描述，专为学习视频-文本表示而设计。该数据集支持多模态理解和生成任务，如零样本动作识别、视频检索以及视频到文本和文本到视频的生成研究，并通过可扩展的语言模型方法构建高质量数据。发布版本InternVid-230M提供了视频片段的ID、时间戳、描述及相关评分，适用于研究人员和实践者在视频AI领域的应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集