ut-vision/EgoHaFL

Name: ut-vision/EgoHaFL
Creator: ut-vision
Published: 2025-11-28 04:25:08
License: 暂无描述

Hugging Face2025-11-28 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ut-vision/EgoHaFL

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: uid dtype: string - name: video_id dtype: string - name: start_second dtype: float32 - name: end_second dtype: float32 - name: caption dtype: string - name: fx dtype: float32 - name: fy dtype: float32 - name: cx dtype: float32 - name: cy dtype: float32 - name: vid_w dtype: int32 - name: vid_h dtype: int32 - name: annotation list: - name: mano_params struct: - name: global_orient list: float32 - name: hand_pose list: float32 - name: betas list: float32 - name: is_right dtype: bool - name: keypoints_3d list: float32 - name: keypoints_2d list: float32 - name: vertices list: float32 - name: box_center list: float32 - name: box_size dtype: float32 - name: camera_t list: float32 - name: focal_length list: float32 splits: - name: train num_examples: 241912 - name: test num_examples: 5108 configs: - config_name: default data_files: - split: train path: EgoHaFL_train.csv - split: test path: EgoHaFL_test.csv license: mit language: - en pretty_name: EgoHaFL:Egocentric 3D Hand Forecasting Dataset with Language Instruction size_categories: - 200K<n<300K tags: - embodied-ai - robotic - egocentric - 3d-hand - forecasting - hand-pose --- # **EgoHaFL: Egocentric 3D Hand Forecasting Dataset with Language Instruction** **EgoHaFL** is a dataset designed for **egocentric (first-person) 3D hand forecasting** with accompanying **natural language instructions**. It contains short video clips, text descriptions, camera intrinsics, and detailed MANO-based 3D hand annotations. The dataset supports research in **3D hand forecasting**, **hand pose estimation**, **hand–object interaction understanding**, and **video–language modeling**. ![Demo GIF](EgoHaFL.gif) [![Paper](https://img.shields.io/badge/Paper-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2511.18127) [![Model](https://img.shields.io/badge/Model-FF6D00?style=for-the-badge\&logo=huggingface\&logoColor=ffffff)](https://huggingface.co/ut-vision/SFHand) [![GitHub](https://img.shields.io/badge/GitHub-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/ut-vision/SFHand) --- ## 📦 **Dataset Contents** ### **1. Metadata CSV Files** * `EgoHaFL_train.csv` * `EgoHaFL_test.csv` Each row corresponds to one sample and contains: | Field | Description | | ---------------- | ------------------------------------------ | | `uid` | Unique sample identifier | | `video_id` | Source video identifier | | `start_second` | Start time of the clip (seconds) | | `end_second` | End time of the clip (seconds) | | `caption` | Natural language instruction / description | | `fx`, `fy` | Camera focal lengths | | `cx`, `cy` | Principal point | | `vid_w`, `vid_h` | Original video resolution | --- ### **2. 3D Hand Annotations (EgoHaFL_lmdb)** The folder `EgoHaFL_lmdb` stores all 3D annotations in **LMDB format**. * **Key**: `uid` * **Value**: a **list of length 16**, representing uniformly sampled frames across a **3-second video segment**. Each of the 16 elements is a dictionary containing: * `mano_params` * `global_orient (n, 1, 3 ,3)` * `hand_pose (n, 15, 3, 3)` * `betas (n, 10)` * `is_right (n,)` * `keypoints_3d (n, 21, 3)` * `keypoints_2d (n, 21, 2)` * `vertices (n, 778, 3)` * `box_center (n, 2)` * `box_size (n,)` * `camera_t (n, 3)` *3D hand position in camera coordinate* * `focal_length (n, 2)` Here, `n` denotes the number of hands present in each frame, which may vary across frames. When no hands are detected, the dictionary is empty. --- ## 🌳 **Annotation Structure (Tree View)** Below is the hierarchical structure for a single annotation entry (`uid → 16-frame list → per-frame dict`): ``` <uid> └── list (length = 16) ├── [0] │ ├── mano_params │ │ ├── global_orient │ │ ├── hand_pose │ │ └── betas │ ├── is_right │ ├── keypoints_3d │ ├── keypoints_2d │ ├── vertices │ ├── box_center │ ├── box_size │ ├── camera_t │ └── focal_length ├── [1] │ └── ... ├── [2] │ └── ... └── ... ``` --- ## 🎥 **Source of Video Data** The video clips used in **EgoHaFL** originate from the **Ego4D V1** dataset. For our experiments, we use the **original-length videos compressed to 224p resolution** to ensure efficient storage and training. Official Ego4D website: 🔗 **[https://ego4d-data.org/](https://ego4d-data.org/)** --- ## 🧩 **Example of Use** For details on how to load and use the EgoHaFL dataset, please refer to the **dataloader implementation** in our GitHub repository: 🔗 **[https://github.com/ut-vision/SFHand](https://github.com/ut-vision/SFHand)** --- ## 🧠 **Supported Research Tasks** * Egocentric 3D hand forecasting * Hand motion prediction and trajectory modeling * 3D hand pose estimation * Hand–object interaction understanding * Video–language multimodal modeling * Temporal reasoning with 3D human hands --- ## 📚 Citation If you use this dataset or find it helpful in your research, please cite: ```latex @article{liu2025sfhand, title={SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation}, author={Liu, Ruicong and Huang, Yifei and Ouyang, Liangyang and Kang, Caixin and and Sato, Yoichi}, journal={arXiv preprint arXiv:2511.18127}, year={2025} } ```

提供机构：

ut-vision

5,000+

优质数据集

54 个

任务类型

进入经典数据集