InterMT
收藏魔搭社区2026-05-09 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/PKU-Alignment/InterMT
下载链接
链接失效反馈官方服务:
资源简介:
# (NeurIPS 2025 Spotlight) InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
[🏠 Homepage](https://pku-intermt.github.io/) | [🤗 InterMT Dataset](https://huggingface.co/datasets/PKU-Alignment/InterMT) | [👍 InterMT-Bench](https://github.com/cby-pku/INTERMT) | [📑 Paper](https://arxiv.org/abs/2505.23950)
## Abstract
As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: ***What essential capabilities are still missing?***
A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation.
To move closer to human-level intelligence, models must similarly support **multi-turn**, **multimodal interaction**. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges.
In this work, we present **an initial exploration** through the *InterMT* -- **the first preference dataset for *multi-turn* multimodal interaction**, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. *InterMT* captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs.
To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances.
To further this goal, we introduce *InterMT-Bench* to assess the ability of
MLLMs in assisting judges with multi-turn, multimodal tasks.
We demonstrate the utility of *InterMT* through applications such as judge moderation and further reveal the *multi-turn scaling law* of judge model.
We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step.

## InterMT
The InterMT dataset includes: (1) carefully crafted *seed questions* for multi-turn, multimodal conversations, and (2) fine-grained human preference annotations at both local and global conversation levels. Inspired by theories from linguistics, human-computer interaction, and cognitive psychology, the seed questions are rigorously selected and refined to enable more faithful simulation of real-world multi-turn understanding and generation tasks.
We collect preference data through score evaluations and pairwise comparisons of multi-modal responses at each conversation turn, based on four sub-dimensions. Global conversation helpfulness is then evaluated via five sub-dimensions. Incorporating natural language feedback further improves annotation quality and alignment with human intent.
The **Data Card** for InterMT is as follow:
1. InterMT is built from a corpus of 100k image-text examples, comprising 72.1% from open-source vision-language datasets, 22.8% from web data, and 5.1% from human-written content. All prompts are refined following constitutional guidelines to improve multi-turn compatibility, resulting in 15604 unique seed questions.
2. Each seed question is expanded via an agent-based multi-turn QA construction workflow, producing at least 8 multi-turn QA instances per prompt. After pruning and filtering, we obtain 52.6k high-quality multi-turn QA instances, with 41.92% containing five or more turns.
3. The resulting 52.6k QA instances cover 15+ vision-language understanding and generation tasks, such as image editing and visual tutorials. Each instance features interleaved textual and visual content in both inputs and outputs, with an average of 5.33 images per conversation.
4. InterMT features 32,459 human preference annotations, organized as score evaluation pairwise comparisons at both the local and global levels. Preferences are decomposed into 9 dimensions of helpfulness, accompanied by human-written critiques, refinement suggestions, and rationales.
### Note
Local preference refers to the preference ranking among different responses within a specific turn of a multi-turn dialogue. It models local human preferences by comparing leaf nodes on the dialogue tree.
Global preference, in contrast, refers to the preference between two entire multi-turn dialogues (or sub-dialogues). It models holistic human preferences for the conversation by comparing the relative order of multiple subsequent sub-dialogues stemming from a common parent node.
`local_prefernce.parquet` compared with `local_images.tar.gz`
`global_prefernce.parquet` compared with `global_images.tar.gz`.
Due to storage reasons, we have uploaded the global preference dataset to [Here](https://www.modelscope.cn/datasets/alignmentresearch/InterMT-Global-Preference).
For more details and information, please visit our [website](https://pku-intermt.github.io)
## Citation
Please cite the repo if you find the model or code in this repo useful 😊
```bibtex
@article{chen2025intermt,
title={InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback},
author={Boyuan Chen and Donghai Hong and Jiaming Ji and Jiacheng Zheng and Bowen Dong and Jiayi Zhou and Kaile Wang and Josef Dai and Xuyao Wang and Wenqi Chen and Qirui Zheng and Wenxin Li and Sirui Han and Yike Guo and Yaodong Yang},
year={2025},
institution={Peking University and Hong Kong University of Science and Technology},
url={https://pku-intermt.github.io},
keywords={Multimodal Learning, Multi-Turn Interaction, Human Feedback, Preference Alignment}
}
```
# (NeurIPS 2025 Spotlight) InterMT:融合人类反馈的多轮交错式偏好对齐
[🏠 主页](https://pku-intermt.github.io/) | [🤗 InterMT数据集](https://huggingface.co/datasets/PKU-Alignment/InterMT) | [👍 InterMT基准测试集(InterMT-Bench)](https://github.com/cby-pku/INTERMT) | [📑 论文](https://arxiv.org/abs/2505.23950)
## 摘要
随着多模态大模型(Multimodal Large Model, MLLM)在各类复杂任务上持续取得进展,一个核心问题随之浮现:**当前仍缺失哪些关键能力?**
人类学习的一个关键特征是与环境的持续交互——这种交互不仅局限于语言,还涵盖多模态理解与生成。为了更接近人类级智能,模型同样需要支持**多轮**、**多模态交互**。具体而言,模型需要能够理解交错式多模态上下文,并在持续的交互中生成连贯的回复。
本研究通过InterMT数据集展开**初步探索**——这是**首个基于真实人类反馈的多轮多模态交互偏好数据集**。在本次探索中,我们特别强调人类督导的重要性,引入专家标注以指导整个流程,这一设计的动机在于当前多模态大模型仍缺乏此类复杂交互能力。InterMT从全局与局部两个层面捕捉人类偏好,共涵盖9个子维度,包含15.6k条提示词、52.6k个多轮对话样本以及32.4k条人类标注的偏好配对数据。
为弥补现有模型在多模态理解与生成能力上的不足,我们提出了一种智能体工作流,借助工具增强型多模态大模型来构建多轮问答样本。为进一步达成该目标,我们推出了InterMT基准测试集(InterMT-Bench),用于评估多模态大模型协助评判者完成多轮多模态任务的能力。
我们通过评判审核等应用场景验证了InterMT的实用价值,并进一步揭示了评判模型的**多轮缩放定律**。我们期望本数据集的开源能够推动后续研究,助力当前多模态大模型的对齐工作迈向新台阶。

## InterMT数据集
InterMT数据集包含两部分内容:(1) 专为多轮多模态对话精心设计的**种子提示词(seed questions)**,以及(2) 针对对话全局与局部层面的细粒度人类偏好标注。我们借鉴语言学、人机交互与认知心理学领域的理论,对种子提示词进行了严格筛选与优化,以更真实地模拟现实世界中的多轮理解与生成任务。
我们通过在每一轮对话中基于4个子维度对多模态回复进行评分与两两对比,来收集偏好数据;而全局对话有用性则通过5个子维度进行评估。此外,融入自然语言反馈可进一步提升标注质量,使其更贴合人类真实意图。
InterMT的**数据卡片**如下:
1. InterMT的构建基于包含10万张图文样本的语料库,其中72.1%来自开源视觉语言数据集、22.8%来自网络数据、5.1%来自人工撰写内容。所有提示词均遵循伦理合规准则进行优化,以提升多轮交互兼容性,最终得到15604条唯一的种子提示词。
2. 每条种子提示词均通过基于智能体的多轮问答构建工作流进行扩展,每个提示词至少生成8个多轮问答样本。经过修剪与筛选后,我们最终得到52.6k个高质量多轮问答样本,其中41.92%的样本包含5轮及以上的交互回合。
3. 最终得到的52.6k个问答样本涵盖15种以上的视觉语言理解与生成任务,例如图像编辑、视觉教程等。每个样本的输入与输出均包含交错式文本与视觉内容,平均每轮对话包含5.33张图像。
4. InterMT共包含32459条人类偏好标注,分为局部与全局层面的评分与两两对比两类。偏好维度被拆解为9项有用性子维度,同时附带人工撰写的评价、优化建议与解释依据。
### 备注
局部偏好指的是在多轮对话的特定回合中,不同回复之间的偏好排序。我们通过对比对话树的叶节点来建模人类的局部偏好。
与之相对,全局偏好指的是两段完整多轮对话(或子对话)之间的偏好关系。我们通过对比源自同一父节点的多个后续子对话的相对顺序,来建模人类对整个对话的整体偏好。
`local_preference.parquet` 与 `local_images.tar.gz` 配套;`global_preference.parquet` 与 `global_images.tar.gz` 配套。受限于存储需求,我们已将全局偏好数据集上传至[此处](https://www.modelscope.cn/datasets/alignmentresearch/InterMT-Global-Preference)。
如需了解更多细节与信息,请访问我们的[官方网站](https://pku-intermt.github.io)
## 引用
若您认为本仓库中的模型或代码对您的研究有所帮助,请引用本仓库😊
bibtex
@article{chen2025intermt,
title={InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback},
author={Boyuan Chen and Donghai Hong and Jiaming Ji and Jiacheng Zheng and Bowen Dong and Jiayi Zhou and Kaile Wang and Josef Dai and Xuyao Wang and Wenqi Chen and Qirui Zheng and Wenxin Li and Sirui Han and Yike Guo and Yaodong Yang},
year={2025},
institution={Peking University and Hong Kong University of Science and Technology},
url={https://pku-intermt.github.io},
keywords={Multimodal Learning, Multi-Turn Interaction, Human Feedback, Preference Alignment}
}
提供机构:
maas
创建时间:
2025-05-16



