Multimodal Human Video Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://chain-of-modality.github.io
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含人类操作任务的视频,结合了肌肉活动与音频信号,旨在使机器人能够学习任务规划和控制参数。该数据集用于评估各种视觉语言模型在不同任务上的表现,如按方块、插入插头、击鼓和开瓶等。其规模涉及多项任务,每项任务有10个测试视频,包含不同的物体和摄像机视角。任务内容为多模态视频分析及现实世界中的机器人评估。
This dataset contains videos of human-operated tasks, paired with muscle activity and audio signals, aiming to enable robots to learn task planning and control parameters. It is used to evaluate the performance of various vision-language models (VLMs) across different tasks, such as pressing blocks, inserting plugs, drumming, and opening bottles. In terms of scale, the dataset covers multiple tasks, with 10 test videos per task, including distinct objects and camera viewpoints. The tasks are designed for multimodal video analysis and real-world robot evaluation.



