VITRA-VLA/VITRA-1M
收藏Hugging Face2025-12-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/VITRA-VLA/VITRA-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- Embodied-AI
- Robotic manipulation
- Vision-Language-Action model
- Human Video
- Dexterous Hand
datasets:
- ego4d
- epic
- egoexo4d
- ssv2
preview: false
size_categories:
- 1M<n<10M
tasks:
- Robotics
- Hand reconstruction
- Video segmentation
modalities:
- 3D
- Text
language:
- en
arxiv:
- 2510.21571
---
<div align="center">
<span style="font-size:32px;">VITRA-1M: Human Hand V-L-A Dataset</span>
</div>
<p align="center">
<a href="https://arxiv.org/abs/2510.21571"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a>
<a href='https://microsoft.github.io/VITRA/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a>
<a href="https://github.com/microsoft/VITRA">
<img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github&logoColor=white" alt="Code Repository">
</a>
<a href='https://huggingface.co/VITRA-VLA/VITRA-VLA-3B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
</p>
## Dataset Summary
VITRA-1M is a large-scale Human Hand Visual-Language-Action (V-L-A) dataset constructed as described in the paper [Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos](https://arxiv.org/abs/2510.21571). It contains **1.2 million short episodes** with segmented language annotations, camera parameters (corrected intrinsics/extrinsics), and 3D hand reconstructions (left and right hands) based on the MANO hand model. Each episode is stored as a single `*.npy` metadata file.
**Project page:** [https://microsoft.github.io/VITRA/](https://microsoft.github.io/VITRA/)
**Note:** Current metadata has been manually inspected with an estimated annotation accuracy of around 90%. Future versions will improve metadata quality.
---
## Dataset Contents & Size
* **Annotation folder:** `{dataset_name}.tar.gz` in `root/`.
* **Statistics folder:** `statistics/{dataset_name}_angle_statistics.json` contains dataset statistics.
* **Intrinsics folder:** `intrinsics/{dataset_name}` contains the intrinsics of videos in Ego4d and Egoexo4d.
**Episode counts per dataset:**
| Dataset | Number of episodes |
| -------------------------- | ------------------ |
| ego4d_cooking_and_cleaning | 454,244 |
| ego4d_other | 494,439 |
| epic | 154,464 |
| egoexo4d | 67,053 |
| ssv2 | 52,718 |
**Extraction instructions:**
```bash
tar -xzvf ego4d_cooking_and_cleaning.tar.gz
tar -xzvf ego4d_other.tar.gz
tar -xzvf egoexo4d.tar.gz
tar -xzvf ssv2.tar.gz
tar -xzvf epic.tar.gz
```
After extraction, the structure is as follows:
```
Dataset_root/
├── intrinsics/
│ ├── {dataset_name}
│ └── ...
├── statistics/
├── {dataset_name}/
│ ├── episode_frame_index.npz
│ └── episodic_annotations/
│ ├── {dataset_name}_{video_name}_ep_{000000}.npy
│ ├── {dataset_name}_{video_name}_ep_{000001}.npy
│ └── ...
├── {dataset_name}.tar.gz
└── ...
```
Each `*.npy` loads as a Python `dict` (`episode_info`) with detailed episode metadata.
---
## Usage
For detailed usage instructions and examples, please refer to the official documentation: [VITRA Data Usage Guide](https://github.com/microsoft/ViTra/data/data.md)
---
Example loading:
```python
import numpy as np
episode_info = np.load('.../episodic_annotations/{dataset_name}_{video_name}_ep_000000.npy', allow_pickle=True).item()
```
The detailed structure of the ``episode_info`` is as follows:
```
episode_info (dict) # Metadata for a single V-L-A episode
├── 'video_clip_id_segment': list[int] # Deprecated
├── 'extrinsics': np.ndarray # (Tx4x4) World2Cam extrinsic matrix
├── 'intrinsics': np.ndarray # (3x3) Camera intrinsic matrix
├── 'video_decode_frame': list[int] # Frame indices in the original raw video (starting from 0)
├── 'video_name': str # Original raw video name
├── 'avg_speed': float # Average wrist movement per frame (in meters)
├── 'total_rotvec_degree': float # Total camera rotation over the episode (in degrees)
├── 'total_transl_dist': float # Total camera translation distance over the episode (in meters)
├── 'anno_type': str # Annotation type, specifying the primary hand action considered when segmenting the episode
├── 'text': (dict) # Textual descriptions for the episode
│ ├── 'left': List[(str, (int, int))] # Each entry contains (description, (start_frame_in_episode, end_frame_in_episode))
│ └── 'right': List[(str, (int, int))] # Same structure for the right hand
├── 'text_rephrase': (dict) # Rephrased textual descriptions from GPT-4
│ ├── 'left': List[(List[str], (int, int))] # Each entry contains (list of rephrased descriptions, (start_frame_in_episode, end_frame_in_episode))
│ └── 'right': List[(List[str], (int, int))] # Same as above for the right hand
├── 'left' (dict) # Left hand 3D pose info
│ ├── 'beta': np.ndarray # (10) MANO hand shape parameters (based on the MANO_RIGHT model)
│ ├── 'global_orient_camspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to camera space
│ ├── 'global_orient_worldspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to world space
│ ├── 'hand_pose': np.ndarray # (Tx15x3x3) Local hand joints rotations (based on the MANO_RIGHT model)
│ ├── 'transl_camspace': np.ndarray # (Tx3) Hand wrist translation in camera space
│ ├── 'transl_worldspace': np.ndarray # (Tx3) Hand wrist translation in world space
│ ├── 'kept_frames': list[int] # (T) 0–1 mask of valid left-hand reconstruction frames
│ ├── 'joints_camspace': np.ndarray # (Tx21x3) 3D hand joint positions in camera space
│ ├── 'joints_worldspace': np.ndarray # (Tx21x3) 3D joint positions in world space
│ ├── 'wrist': np.ndarray # Deprecated
│ ├── 'max_translation_movement': float # Deprecated
│ ├── 'max_wrist_rotation_movement': float # Deprecated
│ └── 'max_finger_joint_angle_movement': float # Deprecated
└── 'right' (dict) # Right hand 3D pose info (same structure as 'left')
├── 'beta': np.ndarray
├── 'global_orient_camspace': np.ndarray
├── 'global_orient_worldspace': np.ndarray
├── 'hand_pose': np.ndarray
├── 'transl_camspace': np.ndarray
├── 'transl_worldspace': np.ndarray
├── 'kept_frames': list[int]
├── 'joints_camspace': np.ndarray
├── 'joints_worldspace': np.ndarray
├── 'wrist': np.ndarray
├── 'max_translation_movement': float
├── 'max_wrist_rotation_movement': float
└── 'max_finger_joint_angle_movement': float
```
---
## Citation
```
@article{li2025vitra,
title = {Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos},
journal = {arXiv preprint arXiv:2510.21571},
author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo},
year = {2025}
}
```
---
## License
This dataset is released under the MIT License.
---
## Acknowledgements
Thanks to Ego4D, Epic-Kitchens, EgoExo4D, and Something-Something V2 for raw video data; thanks to the MANO hand model contributors.
提供机构:
VITRA-VLA



