EgoVid

Name: EgoVid
Creator: maas
Published: 2026-05-23 17:10:00
License: 暂无描述

魔搭社区2026-05-23 更新2024-11-16 收录

下载链接：

https://modelscope.cn/datasets/iic/EgoVid

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description EgoVid is a meticulously curated high-quality action-video dataset designed specifically for egocentric video generation. It encompasse 5 million egocentric video clips and includes detailed action annotations, such as fine-grained kinematic control and high-level text descriptions. Furthermore, it incorporates robust data cleansing strategies to ensure frame consistency, action coherence, and motion smoothness under egocentric conditions. ![EgoVid](./asset/data.jpg) ![EgoVid](./asset/data.gif) ### Data Annotation and Cleaning In order to simulate ego-view videos from egocentric actions, we construct detailed and accurate action annotations for each video segment, encompassing low-level kinematic control (e.g., ego-view translation and rotation), as well as high-level textual descriptions. Additionally, Considering the data quality significantly influences the effectiveness of training generative models. Prior works have delved into various cleansing strategies to improve video datasets, focusing on aesthetics, semantic coherence, and optical flow magnitude. Based on these cleaning strategies, this paper presents a specialized cleansing pipeline specifically designed for egocentric scenarios. ![EgoVid](./asset/clean_ann.jpg) ## Data Preparation ### Source Data Downloading Please refer to the [Ego4D official set](https://ego4d-data.org/#download) to download the source videos. We only need the source videos, so you can skip other metadata, and you can specify video resolution during downloading (1080P: 7.1TB, 540P: 3.5TB). Notably, this repo only contains the action annotations (kinematic and text) and cleaning metadata. ### Data Structure #### Source Ego4D Videos ``` Ego4D ├── v1/ ├── v2/ │ ├── video/ │ │ ├── 0a02a1ed-a327-4753-b270-e95298984b96.mp4 │ │ ├── ... │ ├── video_540ss/ (Optional) │ │ ├── 0a02a1ed-a327-4753-b270-e95298984b96.mp4 │ │ ├── ... ``` #### CSV File Information The key columns in the csv files are: ``` - video_id: VideoID_StartFrame_EndFrame, where VideoID is the filename of source video, StartFrame and EndFrame are the start and end frame index of the video clip. - frame_num: number of frames - fps: frames per second - noun_cls: the Noun class of the action description. - verb_cls: the Verb class of the action description. - llava_cap: the detailed caption of the video clip (annotated by LLaVA-Video). - name: the annotated high-level text action description (summarized by Qwen). - flow_mean: the averaged optical flow magnitude of the video clip. - flow_0_4: the ratio of optical flow magnitude within the range [0, 4]. - flow_4_8: the ratio of optical flow magnitude within the range [4, 8]. - flow_8_12: the ratio of optical flow magnitude within the range [8, 12]. - flow_12_16: the ratio of optical flow magnitude within the range [12, 16]. - flow_16_: the ratio of optical flow magnitude larger than 16. - ti_sim: the CLIP similarity between 4 frames and the action description (split by ','). - ii_sim: the CLIP similarity between the first frame and another 3 frames (split by ','). - dover_score: the DOVER score of the video clip. - egovideo_score: the EgoVid score of the video clip and the action description. ``` Below is the special columns in the egovid-kinematic.csv and egovid-val.csv. Note that there are [known issues](https://ego4d-data.org/docs/data/imu/) in the raw IMU data. Thus we recommend to use the pose annotations (poses.zip). ``` gyro_x: the imu gyroscope data, x-axis gyro_y: the imu gyroscope data, y-axis gyro_z: the imu gyroscope data, z-axis accl_x: the imu accelerometer data, x-axis accl_y: the imu accelerometer data, y-axis accl_z: the imu accelerometer data, z-axis ``` #### Poses File poses.zip contains the kinematic poses of the ego-view camera. ``` unzip poses.zip ``` The file structure is as follows: ``` poses ├── 0a47c74a-dad9-42d5-b937-0f375490f034_0_162/ │ ├── cost.txt (The cost of matching ParticleSfM poses and IMU poses, the lower the better) │ ├── intri.npy (Camera intrinsics with shape [3, 3], which is calculated based on the 540 resolution) │ ├── sfm_pose.npy (Camera extrinsics calculated by ParticleSfM (already scaled), shape [120(Frame num), 4, 4]) │ ├── imu_pose.npy (Camera extrinsics calculated by IMU (already transformed to the camera coordinate)) │ ├── fused_pose.npy (Camera extrinsics calculated by Kalman Filter (recommended)) ├── 0a47c74a-dad9-42d5-b937-0f375490f034_2730_2892/ │ ├── ... ``` ## Acknowledgement Many thanks to these excellent projects: - [Ego4D](https://ego4d-data.org/) - [VBench](https://vchitect.github.io/VBench-project/) - [EgoVideo](https://github.com/OpenGVLab/EgoVideo) - [DOVER](https://github.com/VQAssessment/DOVER) - [Qwen](https://huggingface.co/Qwen) - [RAFT](https://github.com/princeton-vl/RAFT) - [LLaVA-Video](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944) - [ParticleSfM](https://github.com/bytedance/particle-sfm)

# 数据集描述 EgoVid是一款精心甄选的高质量动作视频数据集，专为第一人称视角视频生成（egocentric video generation）任务设计。该数据集包含500万条第一人称视角（egocentric）视频片段，配套包含详细的动作标注，例如细粒度运动学控制（kinematic control）信息与高阶文本描述。此外，数据集集成了鲁棒的数据清洗（data cleansing）策略，以确保第一人称视角场景下的帧一致性、动作连贯性与运动平滑性。 ![EgoVid](./asset/data.jpg) ![EgoVid](./asset/data.gif) ## 数据标注与清洗为了基于第一人称动作模拟第一人称视角视图（ego-view）视频，我们为每个视频片段构建了详尽精准的动作标注，涵盖低阶运动学控制（例如第一人称视角下的平移与旋转）以及高阶文本描述。考虑到数据质量对生成模型（generative models）的训练效果具有显著影响，此前的研究已探索了多种用于优化视频数据集的清洗策略，重点关注美学效果、语义连贯性与光流幅值（optical flow magnitude）。在此基础上，本文提出了一种专为第一人称视角场景设计的专用清洗流程。 ![EgoVid](./asset/clean_ann.jpg) ## 数据准备 ### 源数据下载请参考[Ego4D官方数据集](https://ego4d-data.org/#download)下载源视频。本项目仅需源视频，因此可跳过其他元数据，下载时可指定视频分辨率（1080P：7.1TB，540P：3.5TB）。值得注意的是，本仓库仅包含动作标注（运动学与文本标注）与清洗元数据。 ### 数据结构 #### Ego4D源视频 Ego4D ├── v1/ ├── v2/ │ ├── video/ │ │ ├── 0a02a1ed-a327-4753-b270-e95298984b96.mp4 │ │ ├── ... │ ├── video_540ss/（可选） │ │ ├── 0a02a1ed-a327-4753-b270-e95298984b96.mp4 │ │ ├── ... #### CSV文件信息 CSV文件中的关键列如下： - video_id：视频ID，格式为VideoID_StartFrame_EndFrame，其中VideoID为源视频文件名，StartFrame与EndFrame分别为该视频片段的起始与结束帧索引。 - frame_num：视频片段的总帧数 - fps：帧率（帧每秒） - noun_cls：动作描述的名词类别 - verb_cls：动作描述的动词类别 - llava_cap：视频片段的详细描述（由LLaVA-Video标注） - name：由Qwen总结的标注式高阶文本动作描述 - flow_mean：视频片段的平均光流幅值 - flow_0_4：光流幅值处于区间[0, 4]内的占比 - flow_4_8：光流幅值处于区间[4, 8]内的占比 - flow_8_12：光流幅值处于区间[8, 12]内的占比 - flow_12_16：光流幅值处于区间[12, 16]内的占比 - flow_16_：光流幅值大于16的占比 - ti_sim：4帧与动作描述之间的CLIP相似度（以逗号分隔） - ii_sim：首帧与其余3帧之间的CLIP相似度（以逗号分隔） - dover_score：视频片段的DOVER评分 - egovideo_score：视频片段与动作描述的EgoVid评分以下为`egovid-kinematic.csv`与`egovid-val.csv`中的特殊列。请注意，原始惯性测量单元（IMU）数据存在[已知问题](https://ego4d-data.org/docs/data/imu/)，因此推荐使用姿态标注文件（poses.zip）。 gyro_x：IMU陀螺仪x轴数据 gyro_y：IMU陀螺仪y轴数据 gyro_z：IMU陀螺仪z轴数据 accl_x：IMU加速度计x轴数据 accl_y：IMU加速度计y轴数据 accl_z：IMU加速度计z轴数据 #### 姿态文件 `poses.zip`包含第一人称视角相机的运动学姿态。执行以下命令解压文件： unzip poses.zip 解压后的文件结构如下： poses ├── 0a47c74a-dad9-42d5-b937-0f375490f034_0_162/ │ ├── cost.txt：ParticleSfM姿态与IMU姿态的匹配代价（值越低，匹配效果越好） │ ├── intri.npy：相机内参，形状为[3, 3]，基于540P分辨率计算得到 │ ├── sfm_pose.npy：由ParticleSfM计算的相机外参（已完成缩放），形状为[120(帧数), 4, 4] │ ├── imu_pose.npy：由IMU计算的相机外参（已转换至相机坐标系） │ ├── fused_pose.npy：由卡尔曼滤波（Kalman Filter）计算的相机外参（推荐使用） ├── 0a47c74a-dad9-42d5-b937-0f375490f034_2730_2892/ │ ├── ... ## 致谢衷心感谢以下优秀项目： - [Ego4D](https://ego4d-data.org/) - [VBench](https://vchitect.github.io/VBench-project/) - [EgoVideo](https://github.com/OpenGVLab/EgoVideo) - [DOVER](https://github.com/VQAssessment/DOVER) - [Qwen](https://huggingface.co/Qwen) - [RAFT](https://github.com/princeton-vl/RAFT) - [LLaVA-Video](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944) - [ParticleSfM](https://github.com/bytedance/particle-sfm)

提供机构：

maas

创建时间：

2024-11-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集