PaDT-MLLM/ReferringImageCaptioning

Name: PaDT-MLLM/ReferringImageCaptioning
Creator: PaDT-MLLM
Published: 2025-10-10 04:10:37
License: 暂无描述

Hugging Face2025-10-10 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/PaDT-MLLM/ReferringImageCaptioning

下载链接

链接失效反馈

官方服务：

资源简介：

PaDT是一个统一的范式，使多模态大型语言模型(MLLMs)能够直接生成文本和视觉输出。PaDT的核心是视觉参考标记(VRTs)，它允许MLLMs通过视觉补丁直接表示视觉目标，而不是使用基于文本的边界框坐标。PaDT在四个主要视觉感知和理解任务中实现了最先进的性能，并且具有本机视觉-语言对齐、动态视觉绑定、统一的标记空间、轻量级解码器和强大的多任务泛化等优势。

PaDT is a unified paradigm that enables multimodal large language models (MLLMs) to directly generate both textual and visual outputs. The core of PaDT is Visual Reference Tokens (VRTs), which allow MLLMs to represent visual targets directly through visual patches. PaDT achieves state-of-the-art performance across four major visual perception and understanding tasks, and it has advantages such as native vision-language alignment, dynamic visual binding, unified token space, lightweight decoder, and strong multi-task generalization.

提供机构：

PaDT-MLLM

5,000+

优质数据集

54 个

任务类型

进入经典数据集