A Comparison on Visual Grounding

Name: A Comparison on Visual Grounding
Creator: Open Research Knowledge Graph
Published: 2025-06-16 16:20:08
License: 暂无描述

DataCite Commons2025-06-16 更新2026-05-04 收录

下载链接：

https://orkg.org/comparison/R1387568

下载链接

链接失效反馈

官方服务：

资源简介：

Visual grounding, also known as Referring Expression Comprehension (REC), is a fundamental multimodal AI task that involves localizing a specific object within an image based on a descriptive natural language query. This task is challenging as it requires models to parse complex, sometimes ambiguous, language and understand fine-grained visual details and relationships. The table below offers a detailed comparison of influential academic papers that have shaped this field. It summarizes each paper's core architectural schema—from early two-stage methods to modern Transformer-based and Large Vision-Language Models (LVLMs)—and delves into the specific techniques employed, such as Chain-of-Thought reasoning, Mixture-of-Experts, and self-consistent explanations. Furthermore, the table highlights the key contribution of each work and lists the primary benchmark datasets used for evaluation, including classics like RefCOCO and newer ones, providing a comprehensive overview of the field's evolution and key methodologies.

提供机构：

Open Research Knowledge Graph

创建时间：

2025-06-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集