PaDT-MLLM/RefCOCO
收藏Hugging Face2025-10-10 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/PaDT-MLLM/RefCOCO
下载链接
链接失效反馈官方服务:
资源简介:
PaDT 是一种新的多模态大型语言模型 (MLLM) 方法,允许 MLLM 直接生成文本和视觉输出。它使用视觉参考标记 (VRT) 来表示视觉目标,这些标记比基于文本的边界框坐标更具语义性,并且与实际对象更一致。PaDT 在四个主要的视觉感知和理解任务上进行了验证,实现了最先进的性能。PaDT 的成功归功于其本机视觉语言对齐、动态视觉绑定、统一标记空间、轻量级解码器以及强大的多任务泛化能力。
PaDT is a novel approach for multimodal large language models (MLLMs) that enables direct generation of both textual and visual outputs. It uses Visual Reference Tokens (VRTs) to represent visual targets, which are more semantic and aligned with actual objects compared to text-based bounding box coordinates. PaDT has been validated across four major visual perception and understanding tasks, achieving state-of-the-art performance. The success of PaDT is attributed to its native vision-language alignment, dynamic visual binding, unified token space, lightweight decoder, and strong multi-task generalization.
提供机构:
PaDT-MLLM



