MLLM-AS-A-JUDGE
收藏arXiv2024-02-07 更新2024-06-21 收录
下载链接:
https://github.com/Dongping-Chen/MLLM-as-a-Judge
下载链接
链接失效反馈官方服务:
资源简介:
MLLM-AS-A-JUDGE是由理海大学的研究团队创建的一个综合数据集,旨在评估多模态大型语言模型(MLLMs)在视觉语言领域的评判能力。该数据集包含3300个图像-指令对,涵盖了图像标注、数学推理、文本阅读和信息图理解等多种任务。数据集的创建过程涉及精心挑选的10个不同任务的数据集,并通过四个主流MLLMs生成响应,然后由人工评估者进行严格标注。MLLM-AS-A-JUDGE的应用领域主要集中在解决如何使MLLMs更接近人类偏好,特别是在评分评估、配对比较和批量排序任务中。
MLLM-AS-A-JUDGE is a comprehensive dataset developed by a research team from Lehigh University, designed to evaluate the judging capabilities of multimodal large language models (MLLMs) in the visual-language domain. This dataset includes 3,300 image-instruction pairs, covering diverse tasks such as image captioning, mathematical reasoning, text reading, and infographic understanding. The construction of the dataset involves carefully selecting 10 distinct task-specific datasets, generating responses via four mainstream MLLMs, and then performing rigorous manual annotation by human evaluators. The core application scenarios of MLLM-AS-A-JUDGE focus on addressing how to align MLLMs more closely with human preferences, particularly in scoring evaluation, pairwise comparison, and batch ranking tasks.
提供机构:
理海大学
创建时间:
2024-02-07



