下载链接：

https://modelscope.cn/datasets/lerobot/xvla-soft-fold

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset was created using [LeRobot](https://github.com/huggingface/lerobot). ## Dataset Description **Repository:** [X-VLA](https://thu-air-dream.github.io/X-VLA/) **License:** Apache 2.0 **Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274)) ## Dataset Structure [meta/info.json](meta/info.json): ```json { "codebase_version": "v3.0", "robot_type": "franka", "total_episodes": 1542, "total_frames": 2852512, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 500, "fps": 20, "splits": { "train": "0:1542" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4", "features": { "observation.images.cam_high": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.images.cam_left_wrist": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.images.cam_right_wrist": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.state": { "dtype": "float32", "shape": [ 96 ], "names": [ "eef_euler_0", "eef_euler_1", "eef_euler_2", "eef_euler_3", "eef_euler_4", "eef_euler_5", "eef_euler_6", "eef_euler_7", "eef_euler_8", "eef_euler_9", "eef_euler_10", "eef_euler_11", "eef_euler_12", "eef_euler_13", "eef_quat_0", "eef_quat_1", "eef_quat_2", "eef_quat_3", "eef_quat_4", "eef_quat_5", "eef_quat_6", "eef_quat_7", "eef_quat_8", "eef_quat_9", "eef_quat_10", "eef_quat_11", "eef_quat_12", "eef_quat_13", "eef_quat_14", "eef_quat_15", "eef6d_0", "eef6d_1", "eef6d_2", "eef6d_3", "eef6d_4", "eef6d_5", "eef6d_6", "eef6d_7", "eef6d_8", "eef6d_9", "eef6d_10", "eef6d_11", "eef6d_12", "eef6d_13", "eef6d_14", "eef6d_15", "eef6d_16", "eef6d_17", "eef6d_18", "eef6d_19", "eef_left_time", "eef_right_time", "qpos_0", "qpos_1", "qpos_2", "qpos_3", "qpos_4", "qpos_5", "qpos_6", "qpos_7", "qpos_8", "qpos_9", "qpos_10", "qpos_11", "qpos_12", "qpos_13", "qvel_0", "qvel_1", "qvel_2", "qvel_3", "qvel_4", "qvel_5", "qvel_6", "qvel_7", "qvel_8", "qvel_9", "qvel_10", "qvel_11", "qvel_12", "qvel_13", "effort_0", "effort_1", "effort_2", "effort_3", "effort_4", "effort_5", "effort_6", "effort_7", "effort_8", "effort_9", "effort_10", "effort_11", "effort_12", "effort_13", "qpos_left_time", "qpos_right_time" ] }, "action": { "dtype": "float32", "shape": [ 14 ], "names": { "motors": [ "joint_action_0", "joint_action_1", "joint_action_2", "joint_action_3", "joint_action_4", "joint_action_5", "joint_action_6", "joint_action_7", "joint_action_8", "joint_action_9", "joint_action_10", "joint_action_11", "joint_action_12", "joint_action_13" ] } }, "time_stamp": { "dtype": "float32", "shape": [ 1 ], "names": { "values": [ "global_timestamp" ] } }, "timestamp": { "dtype": "float32", "shape": [ 1 ], "names": null }, "frame_index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "episode_index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "task_index": { "dtype": "int64", "shape": [ 1 ], "names": null } } } ``` ## Note The action labels in this dataset have an issue: the left- and right-arm actions are not perfectly synchronized. We recommend re-interpolating them using the provided timestamps. Please refer to our official code for the [guidance](https://github.com/2toinf/X-VLA/blob/ae1d91f7581af39f080b33f71af72ddcac3457e1/datasets/domain_handler/real_world.py#L53) ## Citation If you find this dataset helpful to your project, please kindly cite us **BibTeX:** ```bibtex @article{zheng2025x, title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}, author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others}, journal = {arXiv preprint arXiv:2510.10274}, year = {2025} } ```

本数据集基于[LeRobot](https://github.com/huggingface/lerobot)构建。 ## 数据集说明 **仓库地址**：[X-VLA](https://thu-air-dream.github.io/X-VLA/) **许可证**：Apache 2.0 **相关论文**：*Zheng等人，2025年，“X-VLA：基于软提示Transformer (Transformer)的可扩展跨形态视觉-语言-动作模型 (Vision-Language-Action Model)”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274)) ## 数据集结构 `[meta/info.json]`: json { "codebase_version": "v3.0", "robot_type": "franka", "total_episodes": 1542, "total_frames": 2852512, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 500, "fps": 20, "splits": { "train": "0:1542" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4", "features": { "observation.images.cam_high": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.images.cam_left_wrist": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.images.cam_right_wrist": { "dtype": "video", "shape": [ 480, 640, 3 ], "names": [ "height", "width", "rgb" ], "info": { "video.height": 480, "video.width": 640, "video.codec": "av1", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 20, "video.channels": 3, "has_audio": false } }, "observation.state": { "dtype": "float32", "shape": [ 96 ], "names": [ "eef_euler_0", "eef_euler_1", "eef_euler_2", "eef_euler_3", "eef_euler_4", "eef_euler_5", "eef_euler_6", "eef_euler_7", "eef_euler_8", "eef_euler_9", "eef_euler_10", "eef_euler_11", "eef_euler_12", "eef_euler_13", "eef_quat_0", "eef_quat_1", "eef_quat_2", "eef_quat_3", "eef_quat_4", "eef_quat_5", "eef_quat_6", "eef_quat_7", "eef_quat_8", "eef_quat_9", "eef_quat_10", "eef_quat_11", "eef_quat_12", "eef_quat_13", "eef_quat_14", "eef_quat_15", "eef6d_0", "eef6d_1", "eef6d_2", "eef6d_3", "eef6d_4", "eef6d_5", "eef6d_6", "eef6d_7", "eef6d_8", "eef6d_9", "eef6d_10", "eef6d_11", "eef6d_12", "eef6d_13", "eef6d_14", "eef6d_15", "eef6d_16", "eef6d_17", "eef6d_18", "eef6d_19", "eef_left_time", "eef_right_time", "qpos_0", "qpos_1", "qpos_2", "qpos_3", "qpos_4", "qpos_5", "qpos_6", "qpos_7", "qpos_8", "qpos_9", "qpos_10", "qpos_11", "qpos_12", "qpos_13", "qvel_0", "qvel_1", "qvel_2", "qvel_3", "qvel_4", "qvel_5", "qvel_6", "qvel_7", "qvel_8", "qvel_9", "qvel_10", "qvel_11", "qvel_12", "qvel_13", "effort_0", "effort_1", "effort_2", "effort_3", "effort_4", "effort_5", "effort_6", "effort_7", "effort_8", "effort_9", "effort_10", "effort_11", "effort_12", "effort_13", "qpos_left_time", "qpos_right_time" ] }, "action": { "dtype": "float32", "shape": [ 14 ], "names": { "motors": [ "joint_action_0", "joint_action_1", "joint_action_2", "joint_action_3", "joint_action_4", "joint_action_5", "joint_action_6", "joint_action_7", "joint_action_8", "joint_action_9", "joint_action_10", "joint_action_11", "joint_action_12", "joint_action_13" ] } }, "time_stamp": { "dtype": "float32", "shape": [ 1 ], "names": { "values": [ "global_timestamp" ] } }, "timestamp": { "dtype": "float32", "shape": [ 1 ], "names": null }, "frame_index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "episode_index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "index": { "dtype": "int64", "shape": [ 1 ], "names": null }, "task_index": { "dtype": "int64", "shape": [ 1 ], "names": null } } } ## 注意事项本数据集的动作标签存在一处瑕疵：左右臂动作未完全同步。我们建议使用提供的时间戳对其进行重新插值处理。相关操作指南请参阅官方代码中的[说明](https://github.com/2toinf/X-VLA/blob/ae1d91f7581af39f080b33f71af72ddcac3457e1/datasets/domain_handler/real_world.py#L53) ## 引用方式若本数据集对你的研究项目有所帮助，请引用我们的工作： **BibTeX格式引用：** bibtex @article{zheng2025x, title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}, author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others}, journal = {arXiv preprint arXiv:2510.10274}, year = {2025} }

应用场景：