five

MMBench_dev

收藏
魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceM4/MMBench_dev
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "MMBench_dev" ## Dataset Description * **Homepage**: https://opencompass.org.cn/mmbench * **Repository**: https://github.com/internLM/OpenCompass/ * **Paper**: https://arxiv.org/abs/2307.06281 * **Leaderboard**: https://opencompass.org.cn/leaderboard-multimodal * **Point of Contact**: opencompass@pjlab.org.cn ### Dataset Summary In recent years, the field has seen a surge in the development of numerous vision-language (VL) models, such as MiniGPT-4 and LLaVA. These models showcase promising performance in tackling previously challenging tasks. However, effectively evaluating these models' performance has become a primary challenge hindering further advancement in large VL models. Traditional benchmarks like VQAv2 and COCO Caption are widely used to provide quantitative evaluations for VL models but suffer from several shortcomings: Dataset Construction: Dataset Construction: Traditional benchmarks tend to evaluate models based on their performance in various tasks, such as image captioning and visual question answering. Unfortunately, these tasks do not fully capture the fine-grained abilities that a model possesses, potentially impeding future optimization efforts. Evaluation Metrics: Existing evaluation metrics lack robustness. For example, VQAv2 targets a single word or phrase, while many current VL models generate sentences as outputs. Although these sentences may correctly answer the corresponding questions, the existing evaluation metric would assign a Fail score due to an inability to exactly match the given answer. Moreover, recently proposed subjective evaluation metrics, such as that used in mPLUG-Owl, offer comprehensive evaluation of VL models. However, these metrics struggle to scale smoothly due to the significant amount of human labor required for evaluation. Additionally, these evaluations are highly biased and difficult to reproduce. To address these limitations, we propose a novel approach by defining a set of fine-grained abilities and collecting relevant questions for each ability. We also introduce innovative evaluation strategies to ensure more robust assessment of model predictions. This new benchmark, called MMBench, boasts the following features: Data Collection: To date, we have gathered approximately 3000 questions spanning 20 ability dimensions. Each question is a multiple-choice format with a single correct answer. Evaluation: For a more reliable evaluation, we employ ChatGPT to match a model's prediction with the choices of a question, and then output the corresponding label (A, B, C, D) as the final prediction. ### Languages All of our questions are presented in single-choice question format, with the number of options ranging from 2 to 4. In addition, all these questions, options, and answers are in English. ## Dataset Structure ### Data Instances We provide a overview of an instance in MMBench as follows: ```text { 'index': 241, 'question': 'Identify the question that Madelyn and Tucker's experiment can best answer.', 'hint': 'The passage below describes an experiment. Read the passage and then follow the instructions below.\n\nMadelyn applied a thin layer of wax to the underside of her snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. She repeated the rides four more times, alternating whether she rode with a thin layer of wax on the board or not. Her friend Tucker timed each ride. Madelyn and Tucker calculated the average time it took to slide straight down the hill on the snowboard with wax compared to the average time on the snowboard without wax.\nFigure: snowboarding down a hill.' 'A': 'Does Madelyn's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?' 'B': 'Does Madelyn's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?' 'image': xxxxxx, 'category': 'identity_reasoning', 'l2-category': 'attribute_reasoning', 'split': 'dev', 'source': 'scienceqa', } ``` ### Data Fields * `index`: the index of the instance in the dataset. * `question`: the question of the instance. * `hint (optional)`: the hint of the instance. * `A`: the first option of the instance. * `B`: the second option of the instance. * `C (optional)`: the third option of the instance. * `D (optional)`: the fourth option of the instance. * `image`: the raw image of the instance. * `category`: the leaf category of the instance. * `l2-category`: the L-2 category of the instance. * `split`: the split of the instance. * `source`: the source of the instance comes from. ### Data Splits Currently, MMBench contains 2974 instances in total, and is splitted into **dev** and **test** splits according to a 4:6 ratio. ## Additional Information ### Citation Information ``` @article{MMBench, author = {Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin}, journal = {arXiv:2307.06281}, title = {MMBench: Is Your Multi-modal Model an All-around Player?}, year = {2023}, } ```

# “MMBench_dev”数据集卡片 ## 数据集描述 * **官方主页**:https://opencompass.org.cn/mmbench * **代码仓库**:https://github.com/internLM/OpenCompass/ * **相关论文**:https://arxiv.org/abs/2307.06281 * **排行榜**:https://opencompass.org.cn/leaderboard-multimodal * **联系方式**:opencompass@pjlab.org.cn ### 数据集概述 近年来,视觉语言(vision-language, VL)模型领域涌现出MiniGPT-4、LLaVA等多款优秀模型,在攻克以往颇具挑战的任务上展现出亮眼性能。但如何有效评估这些模型的性能,已成为制约大型VL模型进一步发展的核心难题。 传统基准数据集(如VQAv2与COCO Caption)虽被广泛用于视觉语言模型的量化评估,但存在诸多缺陷: 1. **数据集构建缺陷**:传统基准通常基于图像字幕生成、视觉问答等任务评估模型性能,然而这类任务无法全面覆盖模型所具备的细粒度能力,可能阻碍后续的优化工作。 2. **评估指标缺陷**:现有评估指标鲁棒性不足。例如,VQAv2以单个单词或短语作为标准答案,而当前多数VL模型会生成整句作为输出,即便这些句子能够正确回答对应问题,现有评估指标也会因无法与标准答案完全匹配而给出失败评分。此外,近期提出的主观评估指标(如mPLUG-Owl所采用的指标)虽能对VL模型进行全面评估,但因需要大量人工标注,难以实现规模化应用,且这类评估存在显著偏差且难以复现。 为解决上述局限,我们提出了一种新颖的评估方案:定义一系列细粒度能力,并为每种能力收集相关问题,同时引入创新的评估策略以确保对模型预测结果的评估更加可靠。这款全新基准名为MMBench,具备以下特点: - **数据收集**:截至目前,我们已收集覆盖20个能力维度的约3000道单项选择题,每道题仅有一个正确答案。 - **评估方式**:为获得更可靠的评估结果,我们采用ChatGPT将模型预测结果与题目选项进行匹配,最终输出对应标签(A、B、C、D)作为模型的最终预测结果。 ### 语言说明 本数据集所有问题均采用单项选择题格式,选项数量为2至4个,且所有问题、选项及答案均使用英文。 ## 数据集结构 ### 数据实例 我们将MMBench中的一条数据实例概述如下: text { "index": 241, "question": "Identify the question that Madelyn and Tucker's experiment can best answer.", "hint": "The passage below describes an experiment. Read the passage and then follow the instructions below. Madelyn applied a thin layer of wax to the underside of her snowboard and rode the board straight down a hill. Then, she removed the wax and rode the snowboard straight down the hill again. She repeated the rides four more times, alternating whether she rode with a thin layer of wax on the board or not. Her friend Tucker timed each ride. Madelyn and Tucker calculated the average time it took to slide straight down the hill on the snowboard with wax compared to the average time on the snowboard without wax. Figure: snowboarding down a hill.", "A": "Does Madelyn's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?", "B": "Does Madelyn's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?", "image": xxxxxx, "category": "identity_reasoning", "l2-category": "attribute_reasoning", "split": "dev", "source": "scienceqa", } ### 数据字段 * `index`:数据实例在数据集中的索引编号。 * `question`:该数据实例的问题描述。 * `hint(可选)`:该数据实例的提示信息。 * `A`:该实例的第一个选项。 * `B`:该实例的第二个选项。 * `C(可选)`:该实例的第三个选项。 * `D(可选)`:该实例的第四个选项。 * `image`:该实例对应的原始图像。 * `category`:该实例的最细粒度类别。 * `l2-category`:该实例的二级类别。 * `split`:该实例所属的数据划分子集。 * `source`:该数据实例的来源数据集。 ### 数据划分 目前,MMBench总计包含2974条数据实例,按照4:6的比例划分为**dev**与**test**两个子集。 ## 附加信息 ### 引用信息 @article{MMBench, author = {Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin}, journal = {arXiv:2307.06281}, title = {MMBench: Is Your Multi-modal Model an All-around Player?}, year = {2023}, }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作