Name: VILMA Vision-Language Manipulation Dataset
Creator: Zenodo
Published: 2026-05-06 08:04:47
License: 暂无描述

下载链接：

https://zenodo.org/doi/10.5281/zenodo.19708162

下载链接

链接失效反馈

官方服务：

资源简介：

DATASET DESCRIPTION This repository contains the VILMA (VIsion Language MAnipulation) dataset, created by the CAIR (Cognitive Artificial Intelligence and Robotics) research group at the CYENS Centre of Excellence. This work was supported by the euROBIN project under the 3rd Open Call for Technology Exchange Programme (Project: "VILMA - Advancing Robotic Manipulation: A Handheld Gripper and Vision-Language Dataset"). This dataset presents a comprehensive collection of multimodal recordings capturing real-world household manipulation tasks performed using handheld grippers. It is designed to support research at the intersection of robotics, embodied AI, and human–robot interaction by providing synchronized sensory, visual, and semantic data streams that reflect the complexity of everyday manipulation. The dataset combines natural language instructions (originally spoken and transcribed to text), egocentric and gripper-mounted video, tracking data, and complementary sensory modalities including depth maps and inter-finger distance measurements. Together, these modalities enable fine-grained analysis of both motion dynamics and task intent, facilitating learning across perception, control, and language grounding. A key feature of the dataset is the diversity and variability of manipulation tasks. These range from coarse, force-dominant interactions (e.g., opening a refrigerator) to precision-driven actions requiring delicate control (e.g., charging a phone). The dataset further captures a spectrum of coordination patterns, including bimanual tasks as well as left- and right-handed execution styles, offering valuable insight into motor strategies and adaptability. To reflect realistic deployment conditions, the dataset incorporates multiple layers of difficulty. Tasks are performed in both clean and cluttered environments, with varying levels of object occlusion and the presence of obstacles. In addition, sequences include human interventions that intentionally disrupt task execution—such as removing or displacing objects—introducing unexpected perturbations that challenge robustness and recovery. Beyond task variation, the dataset emphasizes diversity in context. It includes a wide range of objects, arrangements, surface types, and environmental settings, spanning different locations, lighting conditions, and scene configurations. Multiple participants contribute to the recordings, introducing natural variability in behavior, execution style, and interaction strategies. Overall, this dataset provides a realistic and challenging benchmark for studying multimodal perception, manipulation, and decision-making in unstructured environments, with particular emphasis on robustness, adaptability, and human-centered variability. DATASET ORGANIZATION vilma_dataset.h5 ├── /tasks_info │ ├── /C01 │ │ ├── task_family : str │ │ └── /variants │ │ ├── /V01.1 │ │ │ └── task_instruction : str # from specific spoken instruction │ │ ├── /V01.2 # [same structure as /V01.1] │ │ └── ... # more variants per task family │ ├── /C02 # [same structure as /C01] │ └── ... # more task families │ └── /data ├── /D_C01 │ ├── /D_C01.01 │ │ @participant_id : str # e.g. P01 │ │ @task_id : str # e.g. C01 │ │ @variant_id : str # e.g. V01.1 │ │ @location : str # e.g. cyens_lab, inria_lab, airbnb1_kitchen, etc │ │ └── /repetitions │ │ ├── /R_C01.01.01 │ │ │ ├── /repetition_info │ │ │ │ ├── unimanual_or_bimanual : str │ │ │ │ │ │ │ └── /sensors_data │ │ │ ├── /head_camera │ │ │ │ ├── rgb_video_path : str │ │ │ │ └── depth_video_path : str │ │ │ └── /grippers │ │ │ ├── /right_gripper │ │ │ │ ├── /tracking │ │ │ │ │ ├── position : (T, 3) float32 │ │ │ │ │ └── orientation : (T, 3) float32 │ │ │ │ ├── rgb_video_path : str │ │ │ │ ├── depth_video_path : str │ │ │ │ └── finger_distance_cm : (N,) float32 │ │ │ └── /left_gripper # [same structure as /right_gripper] │ │ │ │ │ ├── /R_C01.01.02 │ │ └── ... # more repetitions │ │ │ ├── /D_C01.02 │ └── ... # more data ├── /D_C02 │ ├── /D_C02.01 │ └── ... # more data └── ... # for all task families The dataset is stored in a hierarchical HDF5 file (vilma_dataset.h5), designed for scalability, clarity, and efficient access. It is divided into two main components: task definitions and recorded data. 1. Task Definitions (/tasks_info) Task Families (Cxx): Each task family represents a high-level activity (e.g., “open fridge”, “charge phone”), capturing a common manipulation objective. Task Variants (Vxx.x): Each task family contains multiple variants, reflecting differences in instructions, object configurations, or execution styles. Each variant includes a natural language instruction, derived from spoken input. This separation enables consistent mapping between abstract task definitions and concrete executions. 2. Recorded Data (/data) This section contains all recorded demonstrations, organized hierarchically: Task-Level Grouping (D_Cxx): Data is grouped by task family for efficient retrieval. Demonstrations (D_Cxx.xx): Each demonstration corresponds to a specific execution instance and includes metadata: Participant identifier Task and variant identifiers Environment/location (e.g., lab, kitchen, apartment) Repetitions (R_Cxx.xx.xx): Each demonstration may include multiple repetitions of the same task, capturing variability in execution. Repetition-level metadata specifies whether the task is unimanual or bimanual (possible values: "unimanual_left", "unimanual_right", "bimanual") Sensor Data (/sensors_data): Each repetition contains synchronized multimodal recordings: Head Camera: RGB and depth video paths Grippers (Left/Right): RGB and depth video paths 6-DoF tracking data (position and orientation over time) Finger distance measurements All RGB and depth maps are included in this reposit, devided in zip filles (D_Cxx.zip). To create the directory of RGB videos and depth maps, extract the zip files under a new folder named data. STATISTICS The dataset comprises of 464 demonstrations totalling 2h38m24s over 11 tasks (2 unimanual only, 4 bimanual only, 5 having both unimanual and bimanual demonstrations) under varying conditions over 10 locations, recorded from 11 users. See more statistics below: Task Family Uni-R Uni-L Bi Count Duration (s) Fold clothes and stack. 0 0 49 49 2159.56 Charge a phone. 0 0 32 32 1382.17 Open dishwasher, place or take out item, and close. 0 0 27 27 1036.00 Hang a towel. 5 0 53 58 899.89 Spray and wipe a small area. 0 0 24 24 898.32 Open drawer and stow an item. 0 30 30 60 798.00 Open fridge, place item, and close the fridge. 1 5 42 48 708.72 Prepare a glass of water. 0 17 30 47 683.13 Place a cup on coaster. 0 44 0 44 416.57 Throw trash in the bin. 27 23 10 60 370.77 Open washing machine, place clothes, and close. 15 0 0 15 149.82 Location Uni-R Uni-L Bi Count Duration (s) cyens_lab 19 41 92 152 2931.72 inria_lab 0 0 78 78 2821.44 airbnb1 28 33 4 65 715.59 airbnb3_kitchen 0 0 46 46 847.41 airbnb2_livingroom 0 17 22 39 1059.57 airbnb2_kitchen 0 23 7 30 345.79 cyens_kitchen_small 1 1 22 24 239.63 airbnb3_bathroom 0 0 13 13 193.15 airbnb3_fridge 0 4 5 9 57.58 airbnb3_livingroom 0 0 8 8 291.87 Participant Uni-R Uni-L Bi Count Duration (s) P02 33 71 133 237 3551.35 P01 15 46 42 103 2279.18 P03 0 2 45 47 884.42 P04 0 0 20 20 589.20 P08 0 0 14 14 569.37 P09 0 0 14 14 468.38 P10 0 0 9 9 495.33 P11 0 0 9 9 278.19 P12 0 0 8 8 230.90 P13 0 0 2 2 125.86 P06 0 0 1 1 31.57 CODE AND HARDWARE DOCUMENTATION The code and hardware documentation in this deposit are a static archive. For the latest updates, issues, and readable assembly guides, please visit the official GitHub repository at https://github.com/CYENS/VILMA CONTACT Dr Vassilis Vassiliades (v.vassiliades@cyens.org.cy) Cognitive Artificial Intelligence and Robotics (CAIR) research group CYENS Centre of Excellence ACKNOWLEDGMENT This work is supported by euROBIN, the European ROBotics and AI Network (Grant agreement 101070596), funded by the European Commission.

应用场景：