M-BEIR
收藏魔搭社区2026-05-10 更新2024-06-01 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/M-BEIR
下载链接
链接失效反馈官方服务:
资源简介:
### **UniIR: Training and Benchmarking Universal Multimodal Information Retrievers** (ECCV 2024)
[**🌐 Homepage**](https://tiger-ai-lab.github.io/UniIR/) | [**🤗 Model(UniIR Checkpoints)**](https://huggingface.co/TIGER-Lab/UniIR) | [**🤗 Paper**](https://huggingface.co/papers/2311.17136) | [**📖 arXiv**](https://arxiv.org/pdf/2311.17136.pdf) | [**GitHub**](https://github.com/TIGER-AI-Lab/UniIR)
<a href="#install-git-lfs" style="color: red;">How to download the M-BEIR Dataset</a>
## 🔔News
- **🔥[2023-12-21]: Our M-BEIR Benchmark is now available for use.**
## **Dataset Summary**
**M-BEIR**, the **M**ultimodal **BE**nchmark for **I**nstructed **R**etrieval, is a comprehensive large-scale retrieval benchmark designed to train and evaluate unified multimodal retrieval models (**UniIR models**).
The M-BEIR benchmark comprises eight multimodal retrieval tasks and ten datasets from a variety of domains and sources.
Each task is accompanied by human-authored instructions, encompassing 1.5 million queries and a pool of 5.6 million retrieval candidates in total.
## **Dataset Structure Overview**
The M-BEIR dataset is structured into five primary components: Query Data, Candidate Pool, Instructions, Qrels, and Images.
### Query Data
Below is the directory structure for the query data:
```
query/
│
├── train/
│ ├── mbeir_cirr_train.jsonl
│ ├── mbeir_edis_train.jsonl
│ ...
├── union_train/
│ └── mbeir_union_up_train.jsonl
├── val/
│ ├── mbeir_visualnews_task0_val.jsonl
│ ├── mbeir_visualnews_task3_val.jsonl
│ ...
└── test/
├── mbeir_visualnews_task0_test.jsonl
├── mbeir_visualnews_task3_test.jsonl
...
```
`train`: Contains all the training data from 8 different datasets formatted in the M-BEIR style.
`mbeir_union_up_train.jsonl`: This file is the default training data for in-batch contrastive training specifically designed for UniIR models.
It aggregates all the data from the train directory and datasets with relatively smaller sizes have been upsampled to balance the training process.
`val`: Contains separate files for validation queries, organized by task.
`test`: Contains separate files for test queries, organized by task.
Every M-BEIR query instance has at least one positive candidate data and possibly no negative candidate data
Each line in a Query Data file represents a unique query. The structure of each query JSON object is as follows::
```json
{
"qid": "A unique identifier formatted as {dataset_id}:{query_id}",
"query_txt": "The text component of the query",
"query_img_path": "The file path to the associated query image",
"query_modality": "The modality type of the query (text, image or text,image)",
"query_src_content": "Additional content from the original dataset, presented as a string by json.dumps()",
"pos_cand_list": [
{
"did": "A unique identifier formatted as {dataset_id}:{doc_id}"
}
// ... more positive candidates
],
"neg_cand_list": [
{
"did": "A unique identifier formatted as {dataset_id}:{doc_id}"
}
// ... more negative candidates
]
}
```
### Candidate Pool
The Candidate Pool contains potential matching documents for the queries.
#### M-BEIR_5.6M
Within the global directory, the default retrieval setting requires models to retrieve positive candidates from a heterogeneous pool encompassing various modalities and domains.
The M-BEIR's global candidate pool, comprising 5.6 million candidates, includes the retrieval corpus from all tasks and datasets.
#### M-BEIR_local
Within the local directory, we provide dataset-task-specific pool as M-BEIR_local. Dataset-task-specific pool contains homogeneous candidates that originate from by the original dataset.
Below is the directory structure for the candidate pool:
```
cand_pool/
│
├── global/
│ ├── mbeir_union_val_cand_pool.jsonl
│ └──mbeir_union_test_cand_pool.jsonl
│
└── local/
├── mbeir_visualnews_task0_cand_pool.jsonl
├── mbeir_visualnews_task3_cand_pool.jsonl
...
```
The structure of each candidate JSON object in cand_pool file is as follows::
```json
{
"did": "A unique identifier for the document, formatted as {dataset_id}:{doc_id}",
"txt": "The text content of the candidate document",
"img_path": "The file path to the candidate document's image",
"modality": "The modality type of the candidate (e.g., text, image or text,image)",
"src_content": "Additional content from the original dataset, presented as a string by json.dumps()"
}
```
### Instructions
`query_instructions.tsv` contains human-authorized instructions within the UniIR framework. Each task is accompanied by four human-authored instructions. For detailed usage, please refer to [**GitHub Repo**](https://github.com/TIGER-AI-Lab/UniIR).
### Qrels
Within the `qrels` directory, you will find qrels for both the validation and test sets. These files serve the purpose of evaluating UniIR models. For detailed information, please refer to [**GitHub Repo**](https://github.com/TIGER-AI-Lab/UniIR).
## **How to Use**
### Downloading the M-BEIR Dataset
<a name="install-git-lfs"></a>
#### Step 1: Install Git Large File Storage (LFS)
Before you begin, ensure that **Git LFS** is installed on your system. Git LFS is essential for handling large data files. If you do not have Git LFS installed, follow these steps:
Download and install Git LFS from the official website.
After installation, run the following command in your terminal to initialize Git LFS:
```
git lfs install
```
#### Step 2: Clone the M-BEIR Dataset Repository
Once Git LFS is set up, you can clone the M-BEIR repo from the current Page. Open your terminal and execute the following command:
```
git clone https://huggingface.co/datasets/TIGER-Lab/M-BEIR
```
Please note that the M-BEIR dataset is quite large, and downloading it can take several hours, depending on your internet connection.
During this time, your terminal may not show much activity. The terminal might appear stuck, but if there's no error message, the download process is still ongoing.
### Decompressing M-BEIR Images
After downloading, you will need to decompress the image files. Follow these steps in your terminal:
```bash
# Navigate to the M-BEIR directory
cd path/to/M-BEIR
# Combine the split tar.gz files into one
sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'
# Extract the images from the tar.gz file
tar -xzf mbeir_images.tar.gz
```
Now, you are ready to use the M-BEIR benchmark.
### Dataloader and Evaluation Pipeline
We offer a dedicated dataloader and evaluation pipeline for the M-BEIR benchmark. Please refer to [**GitHub Repo**](https://github.com/TIGER-AI-Lab/UniIR) for detailed information.
## **Citation**
Please cite our paper if you use our data, model or code.
```
@article{wei2023uniir,
title={UniIR: Training and Benchmarking Universal Multimodal Information Retrievers},
author={Wei, Cong and Chen, Yang and Chen, Haonan and Hu, Hexiang and Zhang, Ge and Fu, Jie and Ritter, Alan and Chen, Wenhu},
journal={arXiv preprint arXiv:2311.17136},
year={2023}
}
```
### **UniIR: 训练与基准测试通用多模态信息检索器(Universal Multimodal Information Retriever)**(ECCV 2024)
[🌐 主页](https://tiger-ai-lab.github.io/UniIR/) | [🤗 模型(UniIR 权重)](https://huggingface.co/TIGER-Lab/UniIR) | [🤗 论文](https://huggingface.co/papers/2311.17136) | [📖 arXiv](https://arxiv.org/pdf/2311.17136.pdf) | [GitHub](https://github.com/TIGER-AI-Lab/UniIR)
<a href="#install-git-lfs" style="color: red;">如何下载M-BEIR数据集</a>
## 🔔 最新动态
- **🔥[2023-12-21]:我们的M-BEIR基准数据集现已正式上线可用。**
## **数据集概述**
**M-BEIR(Multimodal BEnchmark for Instructed Retrieval,面向指令式检索的多模态基准数据集)**是一款大规模综合检索基准,旨在训练与评估统一化多模态检索模型(**UniIR模型**)。M-BEIR基准包含8项多模态检索任务与10个来自不同领域与数据源的数据集,每项任务均附带人工编写的指令,总计涵盖150万条查询与560万条检索候选样本。
## **数据集结构概览**
M-BEIR数据集分为五大核心组件:查询数据(Query Data)、候选池(Candidate Pool)、指令集(Instructions)、相关性标注(Qrels)与图像数据(Images)。
### 查询数据
查询数据的目录结构如下:
query/
│
├── train/
│ ├── mbeir_cirr_train.jsonl
│ ├── mbeir_edis_train.jsonl
│ ...
├── union_train/
│ └── mbeir_union_up_train.jsonl
├── val/
│ ├── mbeir_visualnews_task0_val.jsonl
│ ├── mbeir_visualnews_task3_val.jsonl
│ ...
└── test/
├── mbeir_visualnews_task0_test.jsonl
├── mbeir_visualnews_task3_test.jsonl
...
`train`:包含8个不同数据集的全部训练数据,均采用M-BEIR规范格式。
`mbeir_union_up_train.jsonl`:专为UniIR模型的批次内对比训练设计的默认训练数据,该文件整合了训练目录下的全部数据,并对体量较小的数据集进行了上采样以平衡训练过程。
`val`:包含按任务划分的独立验证查询文件。
`test`:包含按任务划分的独立测试查询文件。
每个M-BEIR查询实例至少包含1个正样本候选,可能无负样本候选。查询数据文件中的每一行代表一条唯一查询,单条查询JSON对象的结构如下:
json
{
"qid": "格式为{dataset_id}:{query_id}的唯一标识符",
"query_txt": "查询的文本部分",
"query_img_path": "关联查询图像的文件路径",
"query_modality": "查询的模态类型(文本、图像或文本+图像)",
"query_src_content": "原始数据集的附加内容,通过json.dumps()转换为字符串格式",
"pos_cand_list": [
{
"did": "格式为{dataset_id}:{doc_id}的唯一标识符"
}
// ... 更多正样本候选
],
"neg_cand_list": [
{
"did": "格式为{dataset_id}:{doc_id}的唯一标识符"
}
// ... 更多负样本候选
]
}
### 候选池
候选池用于存储查询的潜在匹配文档。
#### M-BEIR_5.6M
在全局目录下,默认检索设置要求模型从涵盖多种模态与领域的异构候选池中检索正样本。M-BEIR的全局候选池包含560万条候选样本,整合了所有任务与数据集的检索语料。
#### M-BEIR_local
在本地目录下,我们提供了面向特定数据集-任务的候选池,即M-BEIR_local。该候选池包含源自原始数据集的同构候选样本。
候选池的目录结构如下:
cand_pool/
│
├── global/
│ ├── mbeir_union_val_cand_pool.jsonl
│ └──mbeir_union_test_cand_pool.jsonl
│
└── local/
├── mbeir_visualnews_task0_cand_pool.jsonl
├── mbeir_visualnews_task3_cand_pool.jsonl
...
候选池文件中的单条候选JSON对象结构如下:
json
{
"did": "格式为{dataset_id}:{doc_id}的文档唯一标识符",
"txt": "候选文档的文本内容",
"img_path": "候选文档关联图像的文件路径",
"modality": "候选文档的模态类型(例如:文本、图像或文本+图像)",
"src_content": "原始数据集的附加内容,通过json.dumps()转换为字符串格式"
}
### 指令集
`query_instructions.tsv`包含UniIR框架下的人工编写指令,每项任务均附带4条人工编写的指令。详细使用方式请参阅[**GitHub仓库**](https://github.com/TIGER-AI-Lab/UniIR)。
### 相关性标注(Qrels)
在`qrels`目录下,您可以找到验证集与测试集的相关性标注文件,用于评估UniIR模型的性能。详细信息请参阅[**GitHub仓库**](https://github.com/TIGER-AI-Lab/UniIR)。
## **使用方法**
### 下载M-BEIR数据集
<a name="install-git-lfs"></a>
#### 步骤1:安装Git大文件存储(Git LFS)
在开始前,请确保您的系统已安装Git LFS(Git Large File Storage),该工具是处理大型数据文件的必备组件。若未安装,请按以下步骤操作:
从官方网站下载并安装Git LFS。安装完成后,在终端执行以下命令初始化Git LFS:
git lfs install
#### 步骤2:克隆M-BEIR数据集仓库
完成Git LFS配置后,您可以从本页面克隆M-BEIR仓库。打开终端并执行以下命令:
git clone https://huggingface.co/datasets/TIGER-Lab/M-BEIR
请注意,M-BEIR数据集体量较大,下载过程可能耗时数小时,具体取决于您的网络连接。在此期间,终端可能无明显输出,看似处于停滞状态,但若无错误提示,则下载仍在进行中。
### 解压M-BEIR图像文件
下载完成后,您需要解压图像文件。请在终端按以下步骤操作:
bash
# 导航至M-BEIR目录
cd path/to/M-BEIR
# 将拆分的tar.gz文件合并为一个完整文件
sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'
# 从tar.gz文件中提取图像
tar -xzf mbeir_images.tar.gz
至此,您已可以正常使用M-BEIR基准数据集。
### 数据加载器与评估流水线
我们为M-BEIR基准提供了专用的数据加载器与评估流水线。详细信息请参阅[**GitHub仓库**](https://github.com/TIGER-AI-Lab/UniIR)。
## **引用声明**
若您使用了本数据集、模型或代码,请引用我们的论文。
@article{wei2023uniir,
title={UniIR: Training and Benchmarking Universal Multimodal Information Retrievers},
author={Wei, Cong and Chen, Yang and Chen, Haonan and Hu, Hexiang and Zhang, Ge and Fu, Jie and Ritter, Alan and Chen, Wenhu},
journal={arXiv preprint arXiv:2311.17136},
year={2023}
}
提供机构:
maas
创建时间:
2024-05-29
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



