MMEB-V2
收藏魔搭社区2026-05-22 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/MMEB-V2
下载链接
链接失效反馈官方服务:
资源简介:
# MMEB-V2 (Massive Multimodal Embedding Benchmark)
[**Website**](https://tiger-ai-lab.github.io/VLM2Vec/) |[**Github**](https://github.com/TIGER-AI-Lab/VLM2Vec) | [**🏆Leaderboard**](https://huggingface.co/spaces/TIGER-Lab/MMEB) | [**📖MMEB-V2/VLM2Vec-V2 Paper**](https://arxiv.org/abs/2507.04590) | | [**📖MMEB-V1/VLM2Vec-V1 Paper**](https://arxiv.org/abs/2410.05160) |
## Introduction
Building upon on our original [**MMEB**](https://arxiv.org/abs/2410.05160), **MMEB-V2** expands the evaluation scope to include five new tasks: four video-based tasks — Video Retrieval, Moment Retrieval, Video Classification, and Video Question Answering — and one task focused on visual documents, Visual Document Retrieval. This comprehensive suite enables robust evaluation of multimodal embedding models across static, temporal, and structured visual data settings.
**This Hugging Face repository contains the image and video frames used in MMEB-V2, which need to be downloaded in advance.**
## Guide to All MMEB-V2 Data
**Please review this section carefully for all MMEB-V2–related data.**
- **Image/Video Frames** – Available in this repository.
- **Test File** – Loaded during evaluation from Hugging Face automatically. A comprehensive list of HF paths can be found [here](https://github.com/TIGER-AI-Lab/VLM2Vec/blob/main/src/data/dataset_hf_path.py).
- **Raw Video Files** – In most cases, the video frames are all you need for MMEB evaluation. However, we also provide the raw video files [here](https://huggingface.co/datasets/TIGER-Lab/MMEB_Raw_Video) in case they are needed for specific use cases. Since these files are very large, please download and use them only if necessary.
## 🚀 What's New
- **\[2025.07\]** Release [tech report](https://arxiv.org/abs/2507.04590).
- **\[2025.05\]** Initial release of MMEB-V2/VLM2Vec-V2.
## Dataset Overview
We present an overview of the MMEB-V2 dataset below:
<img width="900" alt="abs" src="overview.png">
## Dataset Structure
The directory structure of this Hugging Face repository is shown below.
For video tasks, we provide sampled frames in this repo. For image tasks, we provide the raw images.
Files from each meta-task are zipped together, resulting in six files. For example, ``video_cls.tar.gz`` contains the sampled frames for the video classification task.
```
→ video-tasks/
├── frames/
│ ├── video_cls.tar.gz
│ ├── video_qa.tar.gz
│ ├── video_ret.tar.gz
│ └── video_mret.tar.gz
→ image-tasks/
├── mmeb_v1.tar.gz
└── visdoc.tar.gz
```
After downloading and unzipping these files locally, you can organize them as shown below. (You may choose to use ``Git LFS`` or ``wget`` for downloading.)
Then, simply specify the correct file path in the configuration file used by your code.
```
→ MMEB
├── video-tasks/
│ └── frames/
│ ├── video_cls/
│ │ ├── UCF101/
│ │ │ └── video_1/ # video ID
│ │ │ ├── frame1.png # frame from video_1
│ │ │ ├── frame2.png
│ │ │ └── ...
│ │ ├── HMDB51/
│ │ ├── Breakfast/
│ │ └── ... # other datasets from video classification category
│ ├── video_qa/
│ │ └── ... # video QA datasets
│ ├── video_ret/
│ │ └── ... # video retrieval datasets
│ └── video_mret/
│ └── ... # moment retrieval datasets
├── image-tasks/
│ ├── mmeb_v1/
│ │ ├── OK-VQA/
│ │ │ ├── image1.png
│ │ │ ├── image2.png
│ │ │ └── ...
│ │ ├── ImageNet-1K/
│ │ └── ... # other datasets from MMEB-V1 category
│ └── visdoc/
│ └── ... # visual document retrieval datasets
```
# MMEB-V2(大规模多模态嵌入基准测试,Massive Multimodal Embedding Benchmark)
[**官网**](https://tiger-ai-lab.github.io/VLM2Vec/) | [**GitHub仓库**](https://github.com/TIGER-AI-Lab/VLM2Vec) | [**🏆排行榜**](https://huggingface.co/spaces/TIGER-Lab/MMEB) | [**📖MMEB-V2/VLM2Vec-V2 论文**](https://arxiv.org/abs/2507.04590) | [**📖MMEB-V1/VLM2Vec-V1 论文**](https://arxiv.org/abs/2410.05160) |
## 引言
本基准在初代[**MMEB**](https://arxiv.org/abs/2410.05160)的基础上进行拓展,**MMEB-V2**将评估范围扩展至五项全新任务:四项基于视频的任务——视频检索(Video Retrieval)、片段检索(Moment Retrieval)、视频分类(Video Classification)以及视频问答(Video Question Answering),以及一项面向视觉文档的任务——视觉文档检索(Visual Document Retrieval)。这套全面的评估套件可在静态、时序及结构化视觉数据场景下,对多模态嵌入模型开展稳健的性能评估。
本Hugging Face仓库包含MMEB-V2评估所需的图像与视频帧,需提前完成下载。
## MMEB-V2 全数据指南
**请仔细阅读本章节以获取所有与MMEB-V2相关的数据。**
- **图像/视频帧**:可在本仓库中获取。
- **测试文件**:评估阶段将自动从Hugging Face加载。完整的Hugging Face路径列表可参见[此处](https://github.com/TIGER-AI-Lab/VLM2Vec/blob/main/src/data/dataset_hf_path.py)。
- **原始视频文件**:多数情况下,仅需使用本仓库提供的视频帧即可完成MMEB评估。但若存在特定使用需求,我们也在[此处](https://huggingface.co/datasets/TIGER-Lab/MMEB_Raw_Video)提供了原始视频文件。由于此类文件体积较大,请仅在必要时下载并使用。
## 🚀 新增内容
- **[2025.07]** 发布[技术报告](https://arxiv.org/abs/2507.04590)。
- **[2025.05]** MMEB-V2/VLM2Vec-V2 首次发布。
## 数据集概览
我们在此展示MMEB-V2数据集的概览:
<img width="900" alt="abs" src="overview.png">
## 数据集结构
本Hugging Face仓库的目录结构如下。对于视频任务,本仓库提供采样后的视频帧;对于图像任务,本仓库提供原始图像。每个元任务对应的文件均已打包压缩,共生成六个压缩包。例如,`video_cls.tar.gz`包含视频分类任务的采样视频帧。
→ video-tasks/
├── frames/
│ ├── video_cls.tar.gz
│ ├── video_qa.tar.gz
│ ├── video_ret.tar.gz
│ └── video_mret.tar.gz
→ image-tasks/
├── mmeb_v1.tar.gz
└── visdoc.tar.gz
下载并解压上述文件至本地后,你可按照如下方式组织文件(你可选择使用Git LFS或wget进行下载)。随后,只需在代码所使用的配置文件中指定正确的文件路径即可。
→ MMEB
├── video-tasks/
│ └── frames/
│ ├── video_cls/
│ │ ├── UCF101/
│ │ │ └── video_1/ # 视频ID
│ │ │ ├── frame1.png # video_1的帧
│ │ │ ├── frame2.png
│ │ │ └── ...
│ │ ├── HMDB51/
│ │ ├── Breakfast/
│ │ └── ... # 视频分类类别下的其他数据集
│ ├── video_qa/
│ │ └── ... # 视频问答数据集
│ ├── video_ret/
│ │ └── ... # 视频检索数据集
│ └── video_mret/
│ └── ... # 片段检索数据集
├── image-tasks/
│ ├── mmeb_v1/
│ │ ├── OK-VQA/
│ │ │ ├── image1.png
│ │ │ ├── image2.png
│ │ │ └── ...
│ │ ├── ImageNet-1K/
│ │ └── ... # MMEB-V1类别下的其他数据集
│ └── visdoc/
│ └── ... # 视觉文档检索数据集
提供机构:
maas
创建时间:
2025-05-08
搜集汇总
数据集介绍

背景与挑战
背景概述
MMEB-V2是一个大规模多模态嵌入基准测试数据集,用于评估多模态模型在静态图像、时序视频和结构化视觉文档数据上的性能。它扩展了前代版本,新增了视频检索、视频分类等五个任务,提供视频帧和图像文件,总大小约157.33GB,由TIGER-Lab开发并采用Apache-2.0许可证。
以上内容由遇见数据集搜集并总结生成



