MultiVL
收藏魔搭社区2025-12-26 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/PKU-Alignment/MultiVL
下载链接
链接失效反馈官方服务:
资源简介:
# InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
[🏠 Homepage](https://pku-intermt.github.io/) | [🤗 InterMT Dataset](https://huggingface.co/datasets/PKU-Alignment/InterMT) | [👍 InterMT-Bench](https://github.com/cby-pku/INTERMT)
## Abstract
As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: ***What essential capabilities are still missing?***
A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation.
To move closer to human-level intelligence, models must similarly support **multi-turn**, **multimodal interaction**. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges.
In this work, we present **an initial exploration** through the *InterMT* -- **the first preference dataset for *multi-turn* multimodal interaction**, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. *InterMT* captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs.
To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances.
To further this goal, we introduce *InterMT-Bench* to assess the ability of
MLLMs in assisting judges with multi-turn, multimodal tasks.
We demonstrate the utility of *InterMT* through applications such as judge moderation and further reveal the *multi-turn scaling law* of judge model.
We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step.

## InterMT
The InterMT dataset includes: (1) carefully crafted *seed questions* for multi-turn, multimodal conversations, and (2) fine-grained human preference annotations at both local and global conversation levels. Inspired by theories from linguistics, human-computer interaction, and cognitive psychology, the seed questions are rigorously selected and refined to enable more faithful simulation of real-world multi-turn understanding and generation tasks.
We collect preference data through score evaluations and pairwise comparisons of multi-modal responses at each conversation turn, based on four sub-dimensions. Global conversation helpfulness is then evaluated via five sub-dimensions. Incorporating natural language feedback further improves annotation quality and alignment with human intent.
The **Data Card** for InterMT is as follow:
1. InterMT is built from a corpus of 100k image-text examples, comprising 72.1% from open-source vision-language datasets, 22.8% from web data, and 5.1% from human-written content. All prompts are refined following constitutional guidelines to improve multi-turn compatibility, resulting in 15604 unique seed questions.
2. Each seed question is expanded via an agent-based multi-turn QA construction workflow, producing at least 8 multi-turn QA instances per prompt. After pruning and filtering, we obtain 52.6k high-quality multi-turn QA instances, with 41.92% containing five or more turns.
3. The resulting 52.6k QA instances cover 15+ vision-language understanding and generation tasks, such as image editing and visual tutorials. Each instance features interleaved textual and visual content in both inputs and outputs, with an average of 5.33 images per conversation.
4. InterMT features 32,459 human preference annotations, organized as score evaluation pairwise comparisons at both the local and global levels. Preferences are decomposed into 9 dimensions of helpfulness, accompanied by human-written critiques, refinement suggestions, and rationales.
`local_prefernce.parquet` compared with `local_images.tar.gz`
For more details and information, please visit our [website](https://pku-intermt.github.io)
## Citation
Please cite the repo if you find the model or code in this repo useful 😊
```bibtex
@article{chen2025intermt,
title={InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback},
author={Boyuan Chen and Donghai Hong and Jiaming Ji and Jiacheng Zheng and Bowen Dong and Jiayi Zhou and Kaile Wang and Josef Dai and Xuyao Wang and Wenqi Chen and Qirui Zheng and Wenxin Li and Sirui Han and Yike Guo and Yaodong Yang},
year={2025},
institution={Peking University and Hong Kong University of Science and Technology},
url={https://pku-intermt.github.io},
keywords={Multimodal Learning, Multi-Turn Interaction, Human Feedback, Preference Alignment}
}
```
# InterMT:基于人类反馈的多轮交错偏好对齐数据集
[🏠 项目主页](https://pku-intermt.github.io/) | [🤗 InterMT 数据集(InterMT Dataset)](https://huggingface.co/datasets/PKU-Alignment/InterMT) | [👍 InterMT-Bench](https://github.com/cby-pku/INTERMT)
## 摘要
随着多模态大模型(multimodal large models, MLLMs)在各类复杂任务上持续取得进展,一个核心问题随之浮现:**当前仍缺失哪些核心能力?**
人类学习的一个关键特征是与环境的持续交互——这种交互不仅局限于语言,还涵盖多模态理解与生成。为了向人类级智能迈进,模型同样需要支持**多轮**、**多模态交互**。具体而言,模型需能够理解交错的多模态上下文,并在持续的交互中生成连贯的回复。
本研究通过InterMT数据集展开**初步探索**——这是**首个基于真实人类反馈的多轮多模态交互偏好数据集**。鉴于当前多模态大模型尚不具备此类复杂交互能力,本探索特别强调人类监督的重要性,引入专家标注以指导标注流程。InterMT从全局与局部两个层面将人类偏好划分为9个子维度,数据集包含15.6k条提示词、52.6k个多轮对话实例以及32.4k条人工标注的偏好配对。
为弥补现有模型在多模态理解与生成能力上的不足,我们提出了一种基于智能体的工作流,借助工具增强型多模态大模型来构建多轮问答实例。为进一步实现该目标,我们推出了InterMT-Bench,用于评估多模态大模型协助评判者完成多轮多模态任务的能力。
我们通过评判者审核等应用场景验证了InterMT的实用性,并进一步揭示了评判模型的**多轮缩放定律**。我们期望本数据集的开源能够助力后续研究,推动现有多模态大模型的对齐研究迈向新台阶。

## InterMT 数据集
InterMT数据集包含两部分内容:(1) 专为多轮多模态对话设计的**种子提示问题**,以及(2) 针对对话全局与局部的细粒度人工偏好标注。本数据集的种子提示问题借鉴了语言学、人机交互与认知心理学领域的理论,经过严格筛选与优化,以更真实地模拟现实世界中的多轮理解与生成任务。
我们依据4个子维度,通过对每一轮对话的多模态回复进行评分评估与配对比较来收集偏好数据;全局对话有用性则通过5个子维度进行评估。加入自然语言反馈进一步提升了标注质量,使其更贴合人类真实意图。
InterMT数据集的数据卡片如下:
1. InterMT的构建基于10万条图文样本语料库,其中72.1%来自开源视觉语言数据集,22.8%来自网络数据,5.1%来自人工撰写内容。所有提示词均依据伦理合规准则进行优化,以提升多轮对话兼容性,最终得到15604条独特的种子提示词。
2. 每条种子提示词均通过基于智能体的多轮问答构建工作流进行扩展,每条提示词至少生成8个多轮问答实例。经过修剪与筛选后,我们最终得到52.6k个高质量多轮问答实例,其中41.92%的实例包含5轮及以上对话。
3. 最终得到的52.6k个问答实例涵盖了15类以上的视觉语言理解与生成任务,例如图像编辑与视觉教程类任务。每个实例的输入与输出均包含交错的文本与视觉内容,平均每轮对话包含5.33张图像。
4. InterMT包含32459条人工偏好标注,分为局部与全局两个层面的评分评估与配对比较。偏好维度被划分为9个有用性子维度,同时附带人工撰写的评价意见、优化建议与合理性解释。
`local_prefernce.parquet` 与 `local_images.tar.gz`
如需了解更多细节与信息,请访问我们的[项目网站](https://pku-intermt.github.io)
## 引用
若您认为本仓库中的模型或代码对您有帮助,请引用本仓库😊
bibtex
@article{chen2025intermt,
title={InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback},
author={Boyuan Chen and Donghai Hong and Jiaming Ji and Jiacheng Zheng and Bowen Dong and Jiayi Zhou and Kaile Wang and Josef Dai and Xuyao Wang and Wenqi Chen and Qirui Zheng and Wenxin Li and Sirui Han and Yike Guo and Yaodong Yang},
year={2025},
institution={Peking University and Hong Kong University of Science and Technology},
url={https://pku-intermt.github.io},
keywords={Multimodal Learning, Multi-Turn Interaction, Human Feedback, Preference Alignment}
}
提供机构:
maas
创建时间:
2025-04-22



