MEGA-Bench

Name: MEGA-Bench
Creator: maas
Published: 2026-01-08 18:01:30
License: 暂无描述

魔搭社区2026-01-08 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/TIGER-Lab/MEGA-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks [ICLR 2025] [**🌐 Homepage**](https://tiger-ai-lab.github.io/MEGA-Bench/) | [**🏆 Leaderboard**](https://huggingface.co/spaces/TIGER-Lab/MEGA-Bench) | [**🤗 Dataset**](https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench) | [**🤗 Paper**](https://huggingface.co/papers/2410.10563) | [**🔎 Visualiaztion**](https://github.com/TIGER-AI-Lab/MEGA-Bench/tree/main/task_examples) | [**📖 arXiv**](https://arxiv.org/abs/2410.10563) | [**GitHub**](https://github.com/TIGER-AI-Lab/MEGA-Bench) ## 🔔 News - [2025-01]: Paper accepted by ICLR 2025. - [2024-10-18]: Initial release of the evaluation code on [our Github repo](https://github.com/TIGER-AI-Lab/MEGA-Bench). - [2024-10-14]: Paper released on arXiv. ## ❗❗ Data Information - We put the file path of images/videos in HF datasets. Please download the zipped data [here](https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench/resolve/main/data.zip?download=true). - We chose not to directly include images in the Parquet files because the viewer of Hugging Face Datasets cannot display rows beyond a size limit, causing visualization failure on some of our tasks. We will provide a visualization page for all tasks to facilitate more straightforward task navigation and inspection. Stay tuned! - The full MEGA-Bench contains two subsets, as described in our paper: 1) **Core**: the Core task set (with 440 tasks), evaluated with a bunch of highly-customized metrics; 2) **Open**: the Open-ended task set (with 65 tasks), evaluated with a multimodal LLM with customized per-task evaluation prompts. In the default setting of MEGA-Bench, each query is accompanied by a one-shot example that demonstrates the task logic and output format - that's why each row has the ``example_text`` and ``example_media`` columns. - We also provide two single-image subsets to evaluate models without multi-image support. The evaluation results of these two subsets will be included in the next version of the paper 1) **Core Single-image**: the single-image tasks in the standard Core subset (with 273 tasks). 2) **Open Single-image**: the single-image tasks in the standard Open-ended subset (with 42 tasks). For the single-image subsets, we do not provide the image for the one-shot example, thus the example only demonstrates the desired output format. - The raw image/video data are collected from various types of resources: self-created, Web screenshots, existing benchmarks/datasets, etc. Please see the full records of data sources in Table 17 of [our paper](https://arxiv.org/abs/2410.10563). Here provides a code to print the prompt structure and media path of a task: ```python from datasets import load_dataset core_data = load_dataset("TIGER-Lab/MEGA-Bench", "core") def format_prompt(example): prompt = "" if example['task_description']: prompt += f"{example['task_description']}\n" if example['example_text']: prompt += f"{example['example_text']}\n" if example['query_text']: prompt += f"{example['query_text']}\n" return prompt def media_path(example): media_path = "" all_media = [] if example['global_media']: if isinstance(example['global_media'], list): all_media.extend(example['global_media']) else: all_media.append(example['global_media']) if example['example_media']: if isinstance(example['example_media'], list): all_media.extend(example['example_media']) else: all_media.append(example['example_media']) if example['query_media']: if isinstance(example['query_media'], list): all_media.extend(example['query_media']) else: all_media.append(example['query_media']) filtered_media = [] for item in all_media: if item: clean_item = item.strip("'[]") if clean_item: filtered_media.append(clean_item) return media_path + "\n".join(filtered_media) # print prompt print(format_prompt(core_data['test'][1])) print(media_path(core_data['test'][1])) ``` The output of the code is as follows: ```python Identify the brand logo presented in the query image. Also provide the country of origin (i.e., where the company was founded) of the brand. Remove all spaces and hyphens from the brand name. If the image does not contain a logo, answer NA for both fields. Demonstration example(s) of the task: Example 1: <image> Example Response: [PLEASE OUTPUT YOUR REASONING] Answer: {'brand name': 'RedBull', 'country of origin': 'Austria'} Answer the new question below. The last part of your response should be of the following format: "Answer: <YOUR ANSWER>" (without angle brackets) where YOUR ANSWER is your answer, following the same task logic and output format of the demonstration example(s). For your answer, do not output additional contents that violate the specified format. Think step by step before answering. <image> ``` The \<image\> or \<video\> in the prompt are from the following media path: ``` ./data/Knowledge/World_Knowledge/Logo_and_Sign/brand_logo_recognition_and_elaboration/1358914336.jpg ./data/Knowledge/World_Knowledge/Logo_and_Sign/brand_logo_recognition_and_elaboration/2145962231.jpg ``` ## Introduction We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. ![MEGA-Bench Taxonomy Tree](resources/mega-bench-taxonomy_tree.png) 🎯 **Key features of MEGA-Bench:** - **505 realistic tasks** encompassing over 8,000 samples from 16 expert annotators - **Wide range of output formats** including numbers, phrases, code, LaTeX, coordinates, JSON, free-form, etc. - **Over 40 metrics** developed to evaluate these diverse tasks - Fine-grained capability report across **multiple dimensions** (e.g., application, input type, output format, skill) - **Interactive visualization** of model capabilities Unlike existing benchmarks that unify problems into standard multi-choice questions, MEGA-Bench embraces the diversity of real-world tasks and their output formats. This allows for a more comprehensive evaluation of vision-language models across various dimensions. ## Evaluation Follow the instructions on [our GitHub repo](https://github.com/TIGER-AI-Lab/MEGA-Bench) to run the evaluation. ## Contact For any questions or concerns, please contact: - Jiacheng Chen: jcchen.work@gmail.com - Wenhu Chen: wenhuchen@uwaterloo.ca ## Citation If you find this work useful for your research, please consider citing our paper: ```bibtex @article{chen2024mega-bench, title={MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks}, author={Chen, Jiacheng and Liang, Tianhao and Siu, Sherman and Wang, Zhengqing and Wang, Kai and Wang, Yubo and Ni, Yuansheng and Zhu, Wang and Jiang, Ziyan and Lyu, Bohan and Jiang, Dongfu and He, Xuan and Liu, Yuan and Hu, Hexiang and Yue, Xiang and Chen, Wenhu}, journal={arXiv preprint arXiv:2410.10563}, year={2024}, } ```

# MEGA-Bench：将多模态评测扩展至500余项真实世界任务 [ICLR 2025] [🌐 主页](https://tiger-ai-lab.github.io/MEGA-Bench/) | [🏆 排行榜](https://huggingface.co/spaces/TIGER-Lab/MEGA-Bench) | [🤗 数据集](https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench) | [🤗 论文](https://huggingface.co/papers/2410.10563) | [🔎 可视化](https://github.com/TIGER-AI-Lab/MEGA-Bench/tree/main/task_examples) | [📖 arXiv](https://arxiv.org/abs/2410.10563) | [GitHub](https://github.com/TIGER-AI-Lab/MEGA-Bench) ## 🔔 最新动态 - [2025-01] 论文被ICLR 2025收录。 - [2024-10-18] 评测代码首次发布于[我们的GitHub仓库](https://github.com/TIGER-AI-Lab/MEGA-Bench)。 - [2024-10-14] 论文在arXiv平台发布。 ## ❗❗ 数据说明 - 我们将图像/视频的文件路径存储在Hugging Face（HF）数据集（Hugging Face Datasets）中，请在此处下载压缩后的数据集：[https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench/resolve/main/data.zip?download=true](https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench/resolve/main/data.zip?download=true)。 - 我们未直接将图像嵌入Parquet文件中，原因是Hugging Face Datasets的查看器无法显示超出尺寸限制的行，这会导致部分任务的可视化失败。我们将为所有任务提供可视化页面，以简化任务浏览与检查流程，请持续关注！ - 完整的MEGA-Bench包含两个子集，如论文所述： 1. **Core**：核心任务集（共440项任务），采用一系列高度定制化的评测指标进行评估； 2. **Open**：开放式任务集（共65项任务），采用搭载定制化单任务评测提示词的多模态大语言模型（Multimodal Large Language Model, MLLM）进行评估。在MEGA-Bench的默认设置中，每个查询都会附带一个示例，用于演示任务逻辑与输出格式——这也是为何每一行数据都包含`example_text`与`example_media`字段。 - 我们还提供了两个单图像子集，用于评估不支持多图像输入的模型。这两个子集的评测结果将收录在下一版论文中： 1. **Core Single-image**：标准Core子集中的单图像任务（共273项任务）。 2. **Open Single-image**：标准Open子集中的单图像任务（共42项任务）。对于单图像子集，我们未提供示例图像，因此示例仅用于演示期望的输出格式。 - 原始图像/视频数据采集自多种来源：自制素材、网页截图、现有基准测试与数据集等。完整的数据来源记录请参见[我们的论文](https://arxiv.org/abs/2410.10563)中的表17。以下为一段用于打印任务提示结构与媒体路径的示例代码： python from datasets import load_dataset core_data = load_dataset("TIGER-Lab/MEGA-Bench", "core") def format_prompt(example): prompt = "" if example['task_description']: prompt += f"{example['task_description']} " if example['example_text']: prompt += f"{example['example_text']} " if example['query_text']: prompt += f"{example['query_text']} " return prompt def media_path(example): media_path = "" all_media = [] if example['global_media']: if isinstance(example['global_media'], list): all_media.extend(example['global_media']) else: all_media.append(example['global_media']) if example['example_media']: if isinstance(example['example_media'], list): all_media.extend(example['example_media']) else: all_media.append(example['example_media']) if example['query_media']: if isinstance(example['query_media'], list): all_media.extend(example['query_media']) else: all_media.append(example['query_media']) filtered_media = [] for item in all_media: if item: clean_item = item.strip("'[]") if clean_item: filtered_media.append(clean_item) return media_path + " ".join(filtered_media) # 打印提示词 print(format_prompt(core_data['test'][1])) print(media_path(core_data['test'][1])) 该代码的输出示例如下： python 请识别查询图像中呈现的品牌商标，并提供该品牌的起源国家（即公司创立地）。请移除品牌名称中的所有空格与连字符。若图像未包含商标，则两个字段均回答NA。任务演示示例：示例1： <image> 示例响应： [请输出推理过程] Answer: {'brand name': 'RedBull', 'country of origin': 'Austria'} 请回答下方的新问题。你的回答最后部分需遵循以下格式："Answer: <YOUR ANSWER>"（无需尖括号），其中<YOUR ANSWER>为你的答案，请严格遵循演示示例的任务逻辑与输出格式。作答前请逐步思考。 <image> 提示词中的<image>或<video>对应如下媒体路径： ./data/Knowledge/World_Knowledge/Logo_and_Sign/brand_logo_recognition_and_elaboration/1358914336.jpg ./data/Knowledge/World_Knowledge/Logo_and_Sign/brand_logo_recognition_and_elaboration/2145962231.jpg ## 引言我们提出MEGA-Bench，一款将多模态评测扩展至500余项真实世界任务的评测套件，旨在覆盖终端用户日常使用中高度多样化的实际场景。我们的目标是构建一批高质量的数据样本，涵盖丰富多样的多模态任务，同时实现高效且精准的模型评测。 ![MEGA-Bench分类树](resources/mega-bench-taxonomy_tree.png) 🎯 **MEGA-Bench的核心特性：** - **505项真实任务**，涵盖来自16位专业标注人员的8000余条样本 - **覆盖广泛的输出格式**，包括数字、短语、代码、LaTeX、坐标、JSON、自由文本等 - **开发了40余种评测指标**以适配这些多样化的任务 - **多维度细粒度能力报告**（例如应用场景、输入类型、输出格式、技能等） - **支持交互式可视化**模型能力表现与现有将任务统一为标准选择题的基准测试不同，MEGA-Bench尊重真实世界任务及其输出格式的多样性，这使得我们能够在更多维度上对视觉语言模型（Vision-Language Model, VLM）进行更全面的评测。 ## 评测运行请按照[我们的GitHub仓库](https://github.com/TIGER-AI-Lab/MEGA-Bench)中的说明运行评测流程。 ## 联系方式如有任何疑问或建议，请联系： - 陈家成（Jiacheng Chen）: jcchen.work@gmail.com - 陈文虎（Wenhu Chen）: wenhuchen@uwaterloo.ca ## 引用如果您的研究中用到了本工作，请引用我们的论文： bibtex @article{chen2024mega-bench, title={MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks}, author={Chen, Jiacheng and Liang, Tianhao and Siu, Sherman and Wang, Zhengqing and Wang, Kai and Wang, Yubo and Ni, Yuansheng and Zhu, Wang and Jiang, Ziyan and Lyu, Bohan and Jiang, Dongfu and He, Xuan and Liu, Yuan and Hu, Hexiang and Yue, Xiang and Chen, Wenhu}, journal={arXiv preprint arXiv:2410.10563}, year={2024}, }

提供机构：

maas

创建时间：

2025-02-04

搜集汇总

数据集介绍