Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception

Name: Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception
Creator: Yanmin Zhou
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/language-guided-long-horizon-manipulation-llm-based-planning-and-visual-perception

下载链接

链接失效反馈

官方服务：

资源简介：

Language-guided long-horizon manipulation of deformable objects presents significant challenges due to high degrees of freedom, complex dynamics, and the need for accurate vision-language grounding. In this work, we focus on multi-step cloth folding\u2014a representative deformable-object manipulation task\u2014requiring both structured long-horizon planning and fine-grained visual perception. To this end, we propose a unified framework that integrates a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Specifically, The LLM-based planner decomposes high-level language instructions into low-level action primitives, bridging the semantic\u2013execution gap, aligning perception with action, and enhancing generalization. The VLM-based perception module employs a SigLIP2-driven architecture with a novel bidirectional cross-attention fusion mechanism and Weight-Decomposed Low-Rank Adaptation (DoRA)-based fine-tuning to achieve language-conditioned fine-grained visual grounding. Experiments in both simulation and real-world settings demonstrate the method\u2019s effectiveness. In simulation, it outperforms state-of-the-art (SOTA) baselines, achieving improvements of 2.23\\%, 1.87\\%, and 33.3\\% on seen instructions, unseen instructions, and unseen tasks, respectively. On a real robot, it robustly executes multi-step folding sequences from language instructions across diverse cloth materials and configurations, demonstrating strong generalization in practical scenarios.

提供机构：

Yanmin Zhou

5,000+

优质数据集

54 个

任务类型

进入经典数据集