ReferringImageCaptioning
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/PaDT-MLLM/ReferringImageCaptioning
下载链接
链接失效反馈官方服务:
资源简介:
<div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
<font size=4><div align='center'>[[🔗 Released Code](https://github.com/Gorilla-Lab-SCUT/PaDT)]
[[🤗 Datasets](https://huggingface.co/collections/PaDT-MLLM/padt-dataset-68e400440ffb8c8f95e5ee20)] [[🤗 Checkpoints](https://huggingface.co/collections/PaDT-MLLM/padt-68e3f5c22e8ecbd6d0d13d43)]</div></font>
<font size=4><div align='center'>[[📄 Tech Report](https://arxiv.org/abs/2510.01954)] [[🤗 Paper](https://huggingface.co/papers/2510.01954)]</div></font>
<div align="center">
<img src="./assets/Pipeline.webp" width="900"/>
<p>Figure A. PaDT pipeline.</p>
</div>
## 🌟 Introduction
We are pleased to introduce **Patch-as-Decodable Token (PaDT)**, a unified paradigm that enables multimodal large language models (MLLMs) to directly generate both textual and visual outputs.
At the core of PaDT are **Visual Reference Tokens (VRTs)**. Unlike conventional MLLMs that represent visual targets using text-based bounding box coordinates (which are often less semantic and poorly aligned with the actual objects, as shown in Figure B), PaDT allows MLLMs to represent visual targets directly through visual patches. These VRTs let the model reason about visual information within the output sequence in a more natural and direct way.
By introducing VRTs, we achieve **semantic reasoning and object-specific visual tokens prediction** within the MLLM’s autoregressive generation process. The predicted visual tokens are then decoded into **low-level outputs** such as localization or segmentation maps using a plug-and-play lightweight PaDT decoder.
As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
### Why PaDT Succeeds?
The success of PaDT stems from its deep insight into the visual capability bottlenecks of MLLMs.
1. **Native Vision-Language Alignment**: Instead of “fitting” vision into text space, PaDT directly treats visual patches as decodable tokens, achieving seamless modality alignment.
2. **Dynamic Visual Binding**: A dynamic embedding mechanism tightly binds Visual Reference Tokens (VRTs) to each image, preventing cross-image confusion.
3. **Unified Token Space**: Enables the LLM to handle language and vision uniformly, simplifying training and improving consistency.
4. **Lightweight Decoder**: Decouples dense prediction from the LLM, preserving its semantic reasoning while adding precise spatial output capability.
5. **Strong Multi-Task Generalization**: The PaDT Pro model, jointly trained on REC/RES/OVD/RIC, can switch tasks via prompts and outperforms single-task models.
We hope this work will **inspire further exploration** in the community:
- What does true multimodal reasoning look like?
- And is a purely text-based output ever sufficient for visual reasoning?
<div align="center">
<img src="./assets/Motivation.webp" width="900"/>
<p>Figure B. Some observations on conventional character-by-character coordinate-generation MLLMs and our PaDT.</p>
</div>
<div align="center">
<img src="./assets/TaskIntroduction.webp" width="900"/>
<p>Figure C. PaDT works on four visual perception and understanding tasks.</p>
</div>
## Quick Start
Clone this repo, and set up the environment with a few commands.
```bash
git clone https://github.com/Gorilla-Lab-SCUT/PaDT.git
conda create -n PaDT python=3.11
conda activate PaDT
bash setup.sh
```
The following contains a code snippet illustrating how to use our PaDT.
```python
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from PaDT import PaDTForConditionalGeneration, VisonTextProcessingClass, parseVRTintoCompletion
TEST_IMG_PATH="./eval/imgs/000000368335.jpg"
MODEL_PATH="PaDT-MLLM/PaDT_Pro_3B"
# load model
model = PaDTForConditionalGeneration.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16, device_map={"": 0})
# load processor
processor = AutoProcessor.from_pretrained(
MODEL_PATH
)
processor = VisonTextProcessingClass(processor, model.config.vision_config.spatial_merge_size)
processor.prepare(model.model.embed_tokens.weight.shape[0])
# question prompt
PROMPT = "Please describe this image."
# construct conversation
message = [
{
"role": "user",
"content": [
{
"type": "image",
"image": TEST_IMG_PATH
}, {
"type": "text",
"text": PROMPT
}
]
}
]
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(message)
prompt_inputs = processor(
text=[text],
images=image_inputs,
padding=True,
padding_side="left",
return_tensors="pt",
add_special_tokens=False
).to("cuda:0")
# generate
with torch.inference_mode():
generate_returned_result = model.generate(**prompt_inputs, use_cache=True, max_new_tokens=1024, do_sample=False,
output_hidden_states=True, return_dict_in_generate=True)
prompt_length = prompt_inputs["input_ids"].size(1)
completion_ids = generate_returned_result['sequences'][:, prompt_length:]
# extract Visual Reference Tokens within the sequence
completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
print("\ngenerate result:", completions[0])
# decode low-level visual task results
low_res_image_embeds = generate_returned_result.past_image_embeds
high_res_image_embeds = generate_returned_result.past_high_res_image_embeds
visual_pe = generate_returned_result.past_visual_pe
decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
print(f"\npred_bboxes: {decoded_list['pred_boxes']},\npred_scores: {decoded_list['pred_score'].sigmoid()}\n")
```
## Models
- PaDT_OVD: Trained on COCO2017 training set.
- PaDT_REC: Trained on RefCOCO/+/g training set.
- PaDT_RIC: Trained on Referring Image Captioning training set.
- PaDT_Pro: Trained on the combined set of COCO2017, RefCOCO/+/g and Referring Image Captioning training sets.
| Model | Base VLM | Checkpoint | Task Type |
| - | - | - | - |
| PaDT_OVD_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_OVD_3B](https://huggingface.co/PaDT-MLLM/PaDT_OVD_3B) | Open Vocabulary Detection |
| PaDT_REC_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_REC_3B](https://huggingface.co/PaDT-MLLM/PaDT_REC_3B) | Referring Expression Comprehension/Segmentation |
| PaDT_RIC_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_RIC_3B](https://huggingface.co/PaDT-MLLM/PaDT_RIC_3B) | Referring Image Captioning |
| PaDT_Pro_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_Pro_3B](https://huggingface.co/PaDT-MLLM/PaDT_Pro_3B) | ALL |
| PaDT_OVD_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_OVD_7B](https://huggingface.co/PaDT-MLLM/PaDT_OVD_7B) | Open Vocabulary Detection |
| PaDT_REC_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_REC_7B](https://huggingface.co/PaDT-MLLM/PaDT_REC_7B) | Referring Expression Comprehension/Segmentation |
| PaDT_RIC_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_RIC_7B](https://huggingface.co/PaDT-MLLM/PaDT_RIC_7B) | Referring Image Captioning |
| PaDT_Pro_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_Pro_7B](https://huggingface.co/PaDT-MLLM/PaDT_Pro_7B) | ALL |
## Showcase
Here are some randomly selected test examples showcasing PaDT’s excellent performance.
- Referring Expression Comprehension/Segmentation and Open Vocabulary Detection Tasks
<div align="center">
<img src="./assets/REC_OVD.webp" width="900"/>
</div>
- Referring Image Captioning Task
<div align="center">
<img src="./assets/RIC.webp" width="900"/>
</div>
- Token Activation Map Comparison
<div align="center">
<img src="./assets/TAM.webp" width="900"/>
</div>
## Training Instruction
Download Datasets:
- [COCO](https://cocodataset.org/#home)
- RefCOCO/+/g
```bash
wget https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
wget https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip
wget https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip
```
Unpack these datasets and place them under the following directory:
```
PaDT/
├── dataset/
│ ├── coco/
│ │ ├── annotations/
│ │ ├── train2014/
│ │ ├── train2017/
│ │ ├── val2014/
│ │ └── val2017/
│ └── RefCOCO/
│ ├── refcoco/
│ ├── refcoco+/
│ └── refcocog/
```
Preprocess the datasets:
- 1. Preprocess via our scripts. (Please first update the dataset path configuration in the preprocessing scripts)
```bash
cd src/preprocess
python process_coco.py
python process_refcoco.py
```
- 2. We also released the preprocessed datasets which are ready to use for training in huggingface.
| Dataset | Dataset Path | Task Type |
| - | - | -|
| COCO | [PaDT-MLLM/COCO](https://huggingface.co/datasets/PaDT-MLLM/COCO) | Open Vocabulary Detection |
| RefCOCO | [PaDT-MLLM/RefCOCO](https://huggingface.co/datasets/PaDT-MLLM/RefCOCO) | Referring Expression Comprehension/Segmentation |
| RIC | [PaDT-MLLM/ReferringImageCaptioning](https://huggingface.co/datasets/PaDT-MLLM/ReferringImageCaptioning) | Referring Image Captioning |
The training scripts in `run_scripts` are ready to execute.
For example: Train the PaDT-Pro 3B model on a single node with 8×96 GB GPUs.
```bash
bash ./run_scripts/padt_pro_3b_sft.sh
```
## Evaluation
We provide a simple inference example in `eval/test_demo.py`. More evaluation scripts will be added soon.
## License Agreement
PaDT is licensed under Apache 2.0.
## References
- [Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs](https://huggingface.co/papers/2510.01954)
## License Agreement
PaDT is licensed under Apache 2.0.
## Citation
We kindly encourage citation of our work if you find it useful.
```
@misc{su2025patchasdecodabletokenunifiedmultimodalvision,
title={Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs},
author={Yongyi Su and Haojie Zhang and Shijie Li and Nanqing Liu and Jingyi Liao and Junyi Pan and Yuan Liu and Xiaofen Xing and Chong Sun and Chen Li and Nancy F. Chen and Shuicheng Yan and Xulei Yang and Xun Xu},
year={2025},
eprint={2510.01954},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.01954},
}
```
<div align="center"><h1>可解码Token化视觉块(Patch-as-Decodable Token, PaDT):面向多模态大语言模型的统一多模态视觉任务</h1></div>
<font size=4><div align="center">[[🔗 已开源代码](https://github.com/Gorilla-Lab-SCUT/PaDT)]
[[🤗 数据集集合](https://huggingface.co/collections/PaDT-MLLM/padt-dataset-68e400440ffb8c8f95e5ee20)] [[🤗 模型权重](https://huggingface.co/collections/PaDT-MLLM/padt-68e3f5c22e8ecbd6d0d13d43)]</div></font>
<font size=4><div align="center">[[📄 技术报告](https://arxiv.org/abs/2510.01954)] [[🤗 论文](https://huggingface.co/papers/2510.01954)]</div></font>
<div align="center">
<img src="./assets/Pipeline.webp" width="900"/>
<p>图A. PaDT工作流程</p>
</div>
## 🌟 研究概述
我们很高兴为大家介绍**可解码Token化视觉块(Patch-as-Decodable Token, PaDT)**,这是一种能够让多模态大语言模型(Multi-Modal Large Language Model, MLLM)直接生成文本与视觉输出的统一范式。
PaDT的核心是**视觉参考Token(Visual Reference Token, VRT)**。传统多模态大语言模型采用基于文本的边界框坐标来表征视觉目标,此类方法往往语义性不足,且与真实物体对齐效果较差,如图B所示;而PaDT则允许多模态大语言模型直接通过视觉块来表征视觉目标。此类视觉参考Token能够让模型以更自然、直接的方式在输出序列中对视觉信息进行推理。
通过引入视觉参考Token,我们可在多模态大语言模型的自回归生成过程中实现**语义推理与目标专属视觉Token预测**。随后,通过即插即用的轻量PaDT解码器,可将预测得到的视觉Token解码为**低层视觉输出**,如定位掩码或分割掩码。
如图C所示,我们已在四大主流视觉感知与理解任务中对PaDT进行了验证。在所有任务中,相较于传统逐字符坐标生成式多模态大语言模型,PaDT均取得了**当前最优(state-of-the-art)**的性能。
### PaDT的成功之道
PaDT的成功源于其对多模态大语言模型视觉能力瓶颈的深刻洞察:
1. **原生视觉-语言对齐**:并非将视觉特征“适配”至文本空间,PaDT直接将视觉块视为可解码Token,实现了无缝的模态对齐。
2. **动态视觉绑定**:采用动态嵌入机制,将视觉参考Token(VRTs)与每张图像紧密绑定,避免跨图像混淆。
3. **统一Token空间**:使大语言模型能够统一处理语言与视觉信息,简化训练流程并提升一致性。
4. **轻量解码器**:将稠密预测任务与大语言模型解耦,在保留其语义推理能力的同时,新增了精准的空间输出能力。
5. **强大的多任务泛化能力**:在REC/RES/OVD/RIC联合训练的PaDT Pro模型可通过提示词切换任务,且性能优于单任务模型。
我们期望本研究能够为社区带来**更多探索灵感**:
- 真正的多模态推理应具备何种形态?
- 仅基于文本的输出是否足以支撑视觉推理任务?
<div align="center">
<img src="./assets/Motivation.webp" width="900"/>
<p>图B. 传统逐字符坐标生成式多模态大语言模型与我们提出的PaDT的对比观察示例</p>
</div>
<div align="center">
<img src="./assets/TaskIntroduction.webp" width="900"/>
<p>图C. PaDT在四大视觉感知与理解任务中的应用</p>
</div>
## 快速上手
克隆本代码仓库,并通过数条命令完成环境配置:
bash
git clone https://github.com/Gorilla-Lab-SCUT/PaDT.git
conda create -n PaDT python=3.11
conda activate PaDT
bash setup.sh
以下代码示例展示了如何使用我们的PaDT模型:
python
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from PaDT import PaDTForConditionalGeneration, VisonTextProcessingClass, parseVRTintoCompletion
TEST_IMG_PATH="./eval/imgs/000000368335.jpg"
MODEL_PATH="PaDT-MLLM/PaDT_Pro_3B"
# 加载模型
model = PaDTForConditionalGeneration.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16, device_map={"": 0})
# 加载处理器
processor = AutoProcessor.from_pretrained(
MODEL_PATH
)
processor = VisonTextProcessingClass(processor, model.config.vision_config.spatial_merge_size)
processor.prepare(model.model.embed_tokens.weight.shape[0])
# 问题提示词
PROMPT = "Please describe this image."
# 构建对话
message = [
{
"role": "user",
"content": [
{
"type": "image",
"image": TEST_IMG_PATH
}, {
"type": "text",
"text": PROMPT
}
]
}
]
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(message)
prompt_inputs = processor(
text=[text],
images=image_inputs,
padding=True,
padding_side="left",
return_tensors="pt",
add_special_tokens=False
).to("cuda:0")
# 生成结果
with torch.inference_mode():
generate_returned_result = model.generate(**prompt_inputs, use_cache=True, max_new_tokens=1024, do_sample=False,
output_hidden_states=True, return_dict_in_generate=True)
prompt_length = prompt_inputs["input_ids"].size(1)
completion_ids = generate_returned_result['sequences'][:, prompt_length:]
# 提取序列中的视觉参考Token
completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
print("
generate result:", completions[0])
# 解码低层视觉任务结果
low_res_image_embeds = generate_returned_result.past_image_embeds
high_res_image_embeds = generate_returned_result.past_high_res_image_embeds
visual_pe = generate_returned_result.past_visual_pe
decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
print(f"
pred_bboxes: {decoded_list['pred_boxes']},
pred_scores: {decoded_list['pred_score'].sigmoid()}
")
## 模型概览
- PaDT_OVD:在COCO2017训练集上训练得到。
- PaDT_REC:在RefCOCO/+/g训练集上训练得到。
- PaDT_RIC:在指代图像描述(Referring Image Captioning)训练集上训练得到。
- PaDT_Pro:在COCO2017、RefCOCO/+/g与指代图像描述训练集的联合数据集上训练得到。
| 模型名称 | 基础视觉语言模型 | 模型权重链接 | 任务类型 |
| - | - | - | -
| PaDT_OVD_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_OVD_3B](https://huggingface.co/PaDT-MLLM/PaDT_OVD_3B) | 开放词汇检测 |
| PaDT_REC_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_REC_3B](https://huggingface.co/PaDT-MLLM/PaDT_REC_3B) | 指代表达式理解/分割 |
| PaDT_RIC_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_RIC_3B](https://huggingface.co/PaDT-MLLM/PaDT_RIC_3B) | 指代图像描述 |
| PaDT_Pro_3B | Qwen2.5VL-3B | [PaDT-MLLM/PaDT_Pro_3B](https://huggingface.co/PaDT-MLLM/PaDT_Pro_3B) | 全任务支持 |
| PaDT_OVD_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_OVD_7B](https://huggingface.co/PaDT-MLLM/PaDT_OVD_7B) | 开放词汇检测 |
| PaDT_REC_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_REC_7B](https://huggingface.co/PaDT-MLLM/PaDT_REC_7B) | 指代表达式理解/分割 |
| PaDT_RIC_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_RIC_7B](https://huggingface.co/PaDT-MLLM/PaDT_RIC_7B) | 指代图像描述 |
| PaDT_Pro_7B | Qwen2.5VL-7B | [PaDT-MLLM/PaDT_Pro_7B](https://huggingface.co/PaDT-MLLM/PaDT_Pro_7B) | 全任务支持 |
## 效果展示
以下为随机选取的测试样例,展示了PaDT优异的性能表现:
- 指代表达式理解/分割与开放词汇检测任务
<div align="center">
<img src="./assets/REC_OVD.webp" width="900"/>
</div>
- 指代图像描述任务
<div align="center">
<img src="./assets/RIC.webp" width="900"/>
</div>
- Token激活图对比
<div align="center">
<img src="./assets/TAM.webp" width="900"/>
</div>
## 训练指南
### 数据集下载
- [COCO](https://cocodataset.org/#home)
- RefCOCO/+/g
bash
wget https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
wget https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip
wget https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip
解压数据集并将其放置至如下目录结构中:
PaDT/
├── dataset/
│ ├── coco/
│ │ ├── annotations/
│ │ ├── train2014/
│ │ ├── train2017/
│ │ ├── val2014/
│ │ └── val2017/
│ └── RefCOCO/
│ ├── refcoco/
│ ├── refcoco+/
│ └── refcocog/
### 数据集预处理
1. 通过我们提供的脚本进行预处理(请先更新预处理脚本中的数据集路径配置):
bash
cd src/preprocess
python process_coco.py
python process_refcoco.py
2. 我们也在Hugging Face平台上发布了已预处理完成的数据集,可直接用于训练:
| 数据集名称 | 数据集链接 | 任务类型 |
| - | - | -|
| COCO | [PaDT-MLLM/COCO](https://huggingface.co/datasets/PaDT-MLLM/COCO) | 开放词汇检测 |
| RefCOCO | [PaDT-MLLM/RefCOCO](https://huggingface.co/datasets/PaDT-MLLM/RefCOCO) | 指代表达式理解/分割 |
| RIC | [PaDT-MLLM/ReferringImageCaptioning](https://huggingface.co/datasets/PaDT-MLLM/ReferringImageCaptioning) | 指代图像描述 |
`run_scripts`目录下的训练脚本均可直接运行。例如:在单节点8张96GB显存GPU上训练PaDT-Pro 3B模型:
bash
bash ./run_scripts/padt_pro_3b_sft.sh
## 评估
我们在`eval/test_demo.py`中提供了一个简易的推理示例,更多评估脚本将陆续更新。
## 许可协议
PaDT采用Apache 2.0开源许可协议。
## 参考文献
- [Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs](https://huggingface.co/papers/2510.01954)
## 许可协议
PaDT采用Apache 2.0开源许可协议。
## 引用格式
我们诚挚邀请您在使用本研究成果时进行引用:
@misc{su2025patchasdecodabletokenunifiedmultimodalvision,
title={Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs},
author={Yongyi Su and Haojie Zhang and Shijie Li and Nanqing Liu and Jingyi Liao and Junyi Pan and Yuan Liu and Xiaofen Xing and Chong Sun and Chen Li and Nancy F. Chen and Shuicheng Yan and Xulei Yang and Xun Xu},
year={2025},
eprint={2510.01954},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.01954},
}
提供机构:
maas
创建时间:
2025-10-16



