Semantic Alignment and Locality-Driven Open-Vocabulary Semantic Segmentation Model
收藏科学数据银行2025-12-29 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=67f8eaa41ad94ad8ab510416760252ae
下载链接
链接失效反馈官方服务:
资源简介:
Objective For embodied intelligence and humanoid robot systems that require real-time perception and decision-making in real environments, this work addresses the common issues of imprecise localization and sensitivity to scale and viewpoint in pixel-level segmentation using existing vision–language models. We propose a training-free inference paradigm that aims to improve the pixel-level discriminative quality and engineering usability of open-vocabulary semantic segmentation without altering pretrained representations. Methods We construct the TG-CLIP framework, which uses a frozen CLIP model as the feature backbone and introduces two types of inference-time operators to enhance local and cross-scale semantics. The first is a text-guided recalibration mechanism, which treats text queries as conditional signals to semantically reconstruct and reproject patch-level representations. The second is multi-view consistency inference, which amplifies predictions that are consistent across views by fusing results from resampling and mirrored viewpoints. All operations are performed entirely in the forward path, without requiring additional annotations or network fine-tuning. Results Evaluations on eight public benchmarks show that TG-CLIP achieves an average mIoU of 45.5%, outperforming multiple existing methods and exceeding the second-best ProxyCLIP (40.1%) by 4.4 percentage points. Qualitative comparisons further demonstrate that TG-CLIP better preserves target details and reduces mis-segmentation and missed detections in complex backgrounds and fine-grained structures, corroborating the quantitative results. Ablation and hyperparameter studies indicate that the two operators provide cumulative performance gains: the text-guided recalibration exhibits strong robustness to temperature parameters, while multi-view consistency inference reveals a quantifiable accuracy–efficiency trade-off with respect to scale and flipping strategies, with the recommended deployment configuration achieving a favorable balance in practice. Conclusion While preserving the zero-shot capability of vision–language models, TG-CLIP achieves significant improvements in pixel-level semantic consistency and robustness to scale and viewpoint variations through two lightweight inference-time operators, providing an easily deployable and robust engineering solution for training-free open-vocabulary semantic segmentation.
提供机构:
Xinjing Wang; Qingdao University of Science and Technology
创建时间:
2025-12-29



