Name: VITRA-VLA/VITRA-1M
Creator: VITRA-VLA
Published: 2025-12-03 17:23:34
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/VITRA-VLA/VITRA-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - Embodied-AI - Robotic manipulation - Vision-Language-Action model - Human Video - Dexterous Hand datasets: - ego4d - epic - egoexo4d - ssv2 preview: false size_categories: - 1M<n<10M tasks: - Robotics - Hand reconstruction - Video segmentation modalities: - 3D - Text language: - en arxiv: - 2510.21571 --- <div align="center"> <span style="font-size:32px;">VITRA-1M: Human Hand V-L-A Dataset</span> </div> <p align="center"> <a href="https://arxiv.org/abs/2510.21571"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a> <a href='https://microsoft.github.io/VITRA/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a> <a href="https://github.com/microsoft/VITRA"> <img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github&logoColor=white" alt="Code Repository"> </a> <a href='https://huggingface.co/VITRA-VLA/VITRA-VLA-3B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> </p> ## Dataset Summary VITRA-1M is a large-scale Human Hand Visual-Language-Action (V-L-A) dataset constructed as described in the paper [Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos](https://arxiv.org/abs/2510.21571). It contains **1.2 million short episodes** with segmented language annotations, camera parameters (corrected intrinsics/extrinsics), and 3D hand reconstructions (left and right hands) based on the MANO hand model. Each episode is stored as a single `*.npy` metadata file. **Project page:** [https://microsoft.github.io/VITRA/](https://microsoft.github.io/VITRA/) **Note:** Current metadata has been manually inspected with an estimated annotation accuracy of around 90%. Future versions will improve metadata quality. --- ## Dataset Contents & Size * **Annotation folder:** `{dataset_name}.tar.gz` in `root/`. * **Statistics folder:** `statistics/{dataset_name}_angle_statistics.json` contains dataset statistics. * **Intrinsics folder:** `intrinsics/{dataset_name}` contains the intrinsics of videos in Ego4d and Egoexo4d. **Episode counts per dataset:** | Dataset | Number of episodes | | -------------------------- | ------------------ | | ego4d_cooking_and_cleaning | 454,244 | | ego4d_other | 494,439 | | epic | 154,464 | | egoexo4d | 67,053 | | ssv2 | 52,718 | **Extraction instructions:** ```bash tar -xzvf ego4d_cooking_and_cleaning.tar.gz tar -xzvf ego4d_other.tar.gz tar -xzvf egoexo4d.tar.gz tar -xzvf ssv2.tar.gz tar -xzvf epic.tar.gz ``` After extraction, the structure is as follows: ``` Dataset_root/ ├── intrinsics/ │ ├── {dataset_name} │ └── ... ├── statistics/ ├── {dataset_name}/ │ ├── episode_frame_index.npz │ └── episodic_annotations/ │ ├── {dataset_name}_{video_name}_ep_{000000}.npy │ ├── {dataset_name}_{video_name}_ep_{000001}.npy │ └── ... ├── {dataset_name}.tar.gz └── ... ``` Each `*.npy` loads as a Python `dict` (`episode_info`) with detailed episode metadata. --- ## Usage For detailed usage instructions and examples, please refer to the official documentation: [VITRA Data Usage Guide](https://github.com/microsoft/ViTra/data/data.md) --- Example loading: ```python import numpy as np episode_info = np.load('.../episodic_annotations/{dataset_name}_{video_name}_ep_000000.npy', allow_pickle=True).item() ``` The detailed structure of the ``episode_info`` is as follows: ``` episode_info (dict) # Metadata for a single V-L-A episode ├── 'video_clip_id_segment': list[int] # Deprecated ├── 'extrinsics': np.ndarray # (Tx4x4) World2Cam extrinsic matrix ├── 'intrinsics': np.ndarray # (3x3) Camera intrinsic matrix ├── 'video_decode_frame': list[int] # Frame indices in the original raw video (starting from 0) ├── 'video_name': str # Original raw video name ├── 'avg_speed': float # Average wrist movement per frame (in meters) ├── 'total_rotvec_degree': float # Total camera rotation over the episode (in degrees) ├── 'total_transl_dist': float # Total camera translation distance over the episode (in meters) ├── 'anno_type': str # Annotation type, specifying the primary hand action considered when segmenting the episode ├── 'text': (dict) # Textual descriptions for the episode │ ├── 'left': List[(str, (int, int))] # Each entry contains (description, (start_frame_in_episode, end_frame_in_episode)) │ └── 'right': List[(str, (int, int))] # Same structure for the right hand ├── 'text_rephrase': (dict) # Rephrased textual descriptions from GPT-4 │ ├── 'left': List[(List[str], (int, int))] # Each entry contains (list of rephrased descriptions, (start_frame_in_episode, end_frame_in_episode)) │ └── 'right': List[(List[str], (int, int))] # Same as above for the right hand ├── 'left' (dict) # Left hand 3D pose info │ ├── 'beta': np.ndarray # (10) MANO hand shape parameters (based on the MANO_RIGHT model) │ ├── 'global_orient_camspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to camera space │ ├── 'global_orient_worldspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to world space │ ├── 'hand_pose': np.ndarray # (Tx15x3x3) Local hand joints rotations (based on the MANO_RIGHT model) │ ├── 'transl_camspace': np.ndarray # (Tx3) Hand wrist translation in camera space │ ├── 'transl_worldspace': np.ndarray # (Tx3) Hand wrist translation in world space │ ├── 'kept_frames': list[int] # (T) 0–1 mask of valid left-hand reconstruction frames │ ├── 'joints_camspace': np.ndarray # (Tx21x3) 3D hand joint positions in camera space │ ├── 'joints_worldspace': np.ndarray # (Tx21x3) 3D joint positions in world space │ ├── 'wrist': np.ndarray # Deprecated │ ├── 'max_translation_movement': float # Deprecated │ ├── 'max_wrist_rotation_movement': float # Deprecated │ └── 'max_finger_joint_angle_movement': float # Deprecated └── 'right' (dict) # Right hand 3D pose info (same structure as 'left') ├── 'beta': np.ndarray ├── 'global_orient_camspace': np.ndarray ├── 'global_orient_worldspace': np.ndarray ├── 'hand_pose': np.ndarray ├── 'transl_camspace': np.ndarray ├── 'transl_worldspace': np.ndarray ├── 'kept_frames': list[int] ├── 'joints_camspace': np.ndarray ├── 'joints_worldspace': np.ndarray ├── 'wrist': np.ndarray ├── 'max_translation_movement': float ├── 'max_wrist_rotation_movement': float └── 'max_finger_joint_angle_movement': float ``` --- ## Citation ``` @article{li2025vitra, title = {Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos}, journal = {arXiv preprint arXiv:2510.21571}, author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo}, year = {2025} } ``` --- ## License This dataset is released under the MIT License. --- ## Acknowledgements Thanks to Ego4D, Epic-Kitchens, EgoExo4D, and Something-Something V2 for raw video data; thanks to the MANO hand model contributors.

应用场景：