xvla-soft-fold
收藏魔搭社区2026-04-28 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/lerobot/xvla-soft-fold
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was created using [LeRobot](https://github.com/huggingface/lerobot).
## Dataset Description
**Repository:** [X-VLA](https://thu-air-dream.github.io/X-VLA/)
**License:** Apache 2.0
**Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274))
## Dataset Structure
[meta/info.json](meta/info.json):
```json
{
"codebase_version": "v3.0",
"robot_type": "franka",
"total_episodes": 1542,
"total_frames": 2852512,
"total_tasks": 1,
"chunks_size": 1000,
"data_files_size_in_mb": 100,
"video_files_size_in_mb": 500,
"fps": 20,
"splits": {
"train": "0:1542"
},
"data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet",
"video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4",
"features": {
"observation.images.cam_high": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.cam_left_wrist": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.cam_right_wrist": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.state": {
"dtype": "float32",
"shape": [
96
],
"names": [
"eef_euler_0",
"eef_euler_1",
"eef_euler_2",
"eef_euler_3",
"eef_euler_4",
"eef_euler_5",
"eef_euler_6",
"eef_euler_7",
"eef_euler_8",
"eef_euler_9",
"eef_euler_10",
"eef_euler_11",
"eef_euler_12",
"eef_euler_13",
"eef_quat_0",
"eef_quat_1",
"eef_quat_2",
"eef_quat_3",
"eef_quat_4",
"eef_quat_5",
"eef_quat_6",
"eef_quat_7",
"eef_quat_8",
"eef_quat_9",
"eef_quat_10",
"eef_quat_11",
"eef_quat_12",
"eef_quat_13",
"eef_quat_14",
"eef_quat_15",
"eef6d_0",
"eef6d_1",
"eef6d_2",
"eef6d_3",
"eef6d_4",
"eef6d_5",
"eef6d_6",
"eef6d_7",
"eef6d_8",
"eef6d_9",
"eef6d_10",
"eef6d_11",
"eef6d_12",
"eef6d_13",
"eef6d_14",
"eef6d_15",
"eef6d_16",
"eef6d_17",
"eef6d_18",
"eef6d_19",
"eef_left_time",
"eef_right_time",
"qpos_0",
"qpos_1",
"qpos_2",
"qpos_3",
"qpos_4",
"qpos_5",
"qpos_6",
"qpos_7",
"qpos_8",
"qpos_9",
"qpos_10",
"qpos_11",
"qpos_12",
"qpos_13",
"qvel_0",
"qvel_1",
"qvel_2",
"qvel_3",
"qvel_4",
"qvel_5",
"qvel_6",
"qvel_7",
"qvel_8",
"qvel_9",
"qvel_10",
"qvel_11",
"qvel_12",
"qvel_13",
"effort_0",
"effort_1",
"effort_2",
"effort_3",
"effort_4",
"effort_5",
"effort_6",
"effort_7",
"effort_8",
"effort_9",
"effort_10",
"effort_11",
"effort_12",
"effort_13",
"qpos_left_time",
"qpos_right_time"
]
},
"action": {
"dtype": "float32",
"shape": [
14
],
"names": {
"motors": [
"joint_action_0",
"joint_action_1",
"joint_action_2",
"joint_action_3",
"joint_action_4",
"joint_action_5",
"joint_action_6",
"joint_action_7",
"joint_action_8",
"joint_action_9",
"joint_action_10",
"joint_action_11",
"joint_action_12",
"joint_action_13"
]
}
},
"time_stamp": {
"dtype": "float32",
"shape": [
1
],
"names": {
"values": [
"global_timestamp"
]
}
},
"timestamp": {
"dtype": "float32",
"shape": [
1
],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
}
}
}
```
## Note
The action labels in this dataset have an issue: the left- and right-arm actions are not perfectly synchronized.
We recommend re-interpolating them using the provided timestamps. Please refer to our official code for the [guidance](https://github.com/2toinf/X-VLA/blob/ae1d91f7581af39f080b33f71af72ddcac3457e1/datasets/domain_handler/real_world.py#L53)
## Citation
If you find this dataset helpful to your project, please kindly cite us
**BibTeX:**
```bibtex
@article{zheng2025x,
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
journal = {arXiv preprint arXiv:2510.10274},
year = {2025}
}
```
本数据集基于[LeRobot](https://github.com/huggingface/lerobot)构建。
## 数据集说明
**仓库地址**:[X-VLA](https://thu-air-dream.github.io/X-VLA/)
**许可证**:Apache 2.0
**相关论文**:*Zheng等人,2025年,“X-VLA:基于软提示Transformer (Transformer)的可扩展跨形态视觉-语言-动作模型 (Vision-Language-Action Model)”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274))
## 数据集结构
`[meta/info.json]`:
json
{
"codebase_version": "v3.0",
"robot_type": "franka",
"total_episodes": 1542,
"total_frames": 2852512,
"total_tasks": 1,
"chunks_size": 1000,
"data_files_size_in_mb": 100,
"video_files_size_in_mb": 500,
"fps": 20,
"splits": {
"train": "0:1542"
},
"data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet",
"video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4",
"features": {
"observation.images.cam_high": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.cam_left_wrist": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.cam_right_wrist": {
"dtype": "video",
"shape": [
480,
640,
3
],
"names": [
"height",
"width",
"rgb"
],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 20,
"video.channels": 3,
"has_audio": false
}
},
"observation.state": {
"dtype": "float32",
"shape": [
96
],
"names": [
"eef_euler_0",
"eef_euler_1",
"eef_euler_2",
"eef_euler_3",
"eef_euler_4",
"eef_euler_5",
"eef_euler_6",
"eef_euler_7",
"eef_euler_8",
"eef_euler_9",
"eef_euler_10",
"eef_euler_11",
"eef_euler_12",
"eef_euler_13",
"eef_quat_0",
"eef_quat_1",
"eef_quat_2",
"eef_quat_3",
"eef_quat_4",
"eef_quat_5",
"eef_quat_6",
"eef_quat_7",
"eef_quat_8",
"eef_quat_9",
"eef_quat_10",
"eef_quat_11",
"eef_quat_12",
"eef_quat_13",
"eef_quat_14",
"eef_quat_15",
"eef6d_0",
"eef6d_1",
"eef6d_2",
"eef6d_3",
"eef6d_4",
"eef6d_5",
"eef6d_6",
"eef6d_7",
"eef6d_8",
"eef6d_9",
"eef6d_10",
"eef6d_11",
"eef6d_12",
"eef6d_13",
"eef6d_14",
"eef6d_15",
"eef6d_16",
"eef6d_17",
"eef6d_18",
"eef6d_19",
"eef_left_time",
"eef_right_time",
"qpos_0",
"qpos_1",
"qpos_2",
"qpos_3",
"qpos_4",
"qpos_5",
"qpos_6",
"qpos_7",
"qpos_8",
"qpos_9",
"qpos_10",
"qpos_11",
"qpos_12",
"qpos_13",
"qvel_0",
"qvel_1",
"qvel_2",
"qvel_3",
"qvel_4",
"qvel_5",
"qvel_6",
"qvel_7",
"qvel_8",
"qvel_9",
"qvel_10",
"qvel_11",
"qvel_12",
"qvel_13",
"effort_0",
"effort_1",
"effort_2",
"effort_3",
"effort_4",
"effort_5",
"effort_6",
"effort_7",
"effort_8",
"effort_9",
"effort_10",
"effort_11",
"effort_12",
"effort_13",
"qpos_left_time",
"qpos_right_time"
]
},
"action": {
"dtype": "float32",
"shape": [
14
],
"names": {
"motors": [
"joint_action_0",
"joint_action_1",
"joint_action_2",
"joint_action_3",
"joint_action_4",
"joint_action_5",
"joint_action_6",
"joint_action_7",
"joint_action_8",
"joint_action_9",
"joint_action_10",
"joint_action_11",
"joint_action_12",
"joint_action_13"
]
}
},
"time_stamp": {
"dtype": "float32",
"shape": [
1
],
"names": {
"values": [
"global_timestamp"
]
}
},
"timestamp": {
"dtype": "float32",
"shape": [
1
],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
}
}
}
## 注意事项
本数据集的动作标签存在一处瑕疵:左右臂动作未完全同步。我们建议使用提供的时间戳对其进行重新插值处理。相关操作指南请参阅官方代码中的[说明](https://github.com/2toinf/X-VLA/blob/ae1d91f7581af39f080b33f71af72ddcac3457e1/datasets/domain_handler/real_world.py#L53)
## 引用方式
若本数据集对你的研究项目有所帮助,请引用我们的工作:
**BibTeX格式引用:**
bibtex
@article{zheng2025x,
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
journal = {arXiv preprint arXiv:2510.10274},
year = {2025}
}
提供机构:
maas
创建时间:
2025-12-04



