EgoVid-5M

Name: EgoVid-5M
Creator: 阿里巴巴, 中国科学院自动化研究所, 清华大学, 中国科学院大学
Published: 2024-11-13 15:05:40
License: 暂无描述

arXiv2024-11-13 更新2024-11-15 收录

下载链接：

https://egovid.github.io

下载链接

链接失效反馈

官方服务：

资源简介：

EgoVid-5M是由阿里巴巴、中国科学院自动化研究所、清华大学和中国科学院大学联合创建的一个大规模高分辨率视频动作数据集，专门用于第一人称视角视频生成。该数据集包含500万个1080p分辨率的视频片段，涵盖家庭环境、户外设置、办公活动、体育运动和技能操作等多种场景。数据集通过精细的数据清洗和标注流程，确保了帧间一致性、动作连贯性和运动平滑性。EgoVid-5M的创建旨在推动虚拟现实、增强现实和游戏领域的应用，解决第一人称视角视频生成中的动态视角、复杂动作和多样场景的挑战。

EgoVid-5M is a large-scale high-resolution video action dataset jointly created by Alibaba, the Institute of Automation of the Chinese Academy of Sciences, Tsinghua University, and the University of Chinese Academy of Sciences, specifically designed for first-person perspective video generation. This dataset contains 5 million 1080p-resolution video clips, covering various scenarios including home environments, outdoor settings, office activities, sports events, and skill operations. Through a meticulous data cleaning and annotation workflow, it ensures inter-frame consistency, action coherence, and motion smoothness. The development of EgoVid-5M aims to promote applications in virtual reality (VR), augmented reality (AR), and gaming, and address the challenges of dynamic perspectives, complex actions, and diverse scenarios in first-person perspective video generation.

提供机构：

阿里巴巴, 中国科学院自动化研究所, 清华大学, 中国科学院大学

创建时间：

2024-11-13

搜集汇总

数据集介绍

构建方式

EgoVid-5M is meticulously curated from the Ego4D dataset, which initially contains thousands of hours of egocentric videos intended for perception tasks. To adapt these videos for generative training, a sophisticated data annotation pipeline is established, providing detailed and accurate annotations of fine-grained kinematic control and high-level action descriptions. Additionally, a robust data cleaning pipeline is implemented to ensure frame consistency, action coherence, and motion smoothness under egocentric conditions. This pipeline includes stringent criteria for alignment between action descriptions and video content, magnitude of motion, and frame-to-frame consistency.

特点

EgoVid-5M stands out with several key features: (1) High Quality, offering 5 million egocentric videos at 1080p resolution, curated through rigorous data cleaning processes. (2) Comprehensive Scene Coverage, encompassing a wide range of scenarios including household environments, outdoor settings, office activities, sports, and skilled operations. (3) Detailed and Precise Annotations, featuring extensive behavioral annotations categorized into fine-grained kinematic control and high-level action descriptions, ensuring accurate alignment with video contents.

使用方法

EgoVid-5M is designed to enhance the training of egocentric video generation models. Researchers can utilize this dataset to train various architectures such as U-Net and DiT, leveraging the detailed annotations and robust data cleaning strategies. The dataset supports both Image+Text-to-Video and Text-to-Video tasks, providing a comprehensive resource for advancing research in egocentric video generation. Additionally, the associated action annotations and data cleansing metadata are made publicly available to facilitate further exploration and development in this domain.

背景与挑战

背景概述

在视频生成领域，以人类视角为中心的自我中心视频生成（egocentric video generation）因其对虚拟现实（VR）、增强现实（AR）和游戏等应用的潜在增强作用而备受关注。然而，由于自我中心视角的动态性、动作的复杂多样性以及场景的复杂多变性，生成高质量的自我中心视频面临着巨大挑战。现有的数据集无法有效应对这些挑战。为此，Xiaofeng Wang等研究人员于2024年推出了EgoVid-5M数据集，这是首个专为自我中心视频生成精心设计的高质量数据集。该数据集包含500万个自我中心视频片段，并附有详细的动作注释，包括细粒度的运动控制和高层次的文本描述。通过实施复杂的数据清洗流程，确保了帧间一致性、动作连贯性和运动平滑性，EgoVid-5M旨在推动自我中心视频生成领域的研究与应用。

当前挑战

EgoVid-5M数据集在构建过程中面临多项挑战。首先，自我中心视频的动态视角和复杂动作使得数据标注和清洗变得尤为复杂。其次，确保视频内容与动作描述的高度一致性以及帧间运动平滑性是数据集质量的关键。此外，如何在保持数据多样性的同时，过滤掉低质量的视频片段，以提高训练模型的效果，也是一大挑战。最后，由于自我中心视频生成模型的训练需要大量高质量数据，数据集的构建和清洗过程需要耗费大量计算资源和时间。

常用场景

经典使用场景

EgoVid-5M 数据集在虚拟现实（VR）、增强现实（AR）和游戏等应用中具有经典的使用场景。该数据集通过提供详细的动作注释，包括细粒度的运动控制和高层次的文本描述，使得研究人员能够训练出能够生成高质量第一人称视角视频的模型。这些模型可以用于创建更加沉浸和互动的用户体验，从而推动这些领域的发展。

解决学术问题

EgoVid-5M 数据集解决了在第一人称视角视频生成中常见的学术研究问题，如动态视角的复杂性、动作的多样性和场景的复杂性。通过提供高质量的视频和详细的注释，该数据集使得研究人员能够开发出更加精确和逼真的视频生成模型，从而推动了相关领域的研究进展。

衍生相关工作

基于 EgoVid-5M 数据集，研究人员开发了多种相关的经典工作，如 EgoDreamer 模型。EgoDreamer 利用动作描述和运动控制信号同时驱动第一人称视角视频的生成，显著提升了视频生成的质量和逼真度。此外，该数据集还促进了其他相关研究，如视频生成模型的评估和优化，以及多模态数据融合技术的研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集