mmE5-MMEB-hardneg
收藏魔搭社区2026-04-28 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/intfloat/mmE5-MMEB-hardneg
下载链接
链接失效反馈官方服务:
资源简介:
# mmE5 Labeled Data
This dataset contains datasets used for the supervised finetuning of mmE5 ([mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data](https://arxiv.org/abs/2502.08468)):
- **MMEB** (with hard negative)
- **InfoSeek** (from M-BEIR)
- **TAT-DQA**
- **ArxivQA**
[Github](https://github.com/haon-chen/mmE5)
## Image Preparation
First, you should prepare the images used for training:
### Image Downloads
- **Download All Images Used in mmE5**:
You can use the script provided in our [source code](https://github.com/haon-chen/mmE5) to download all images used in mmE5.
```bash
git clone https://github.com/haon-chen/mmE5.git
cd mmE5
bash scripts/prepare_images.sh
```
### Image Organization
```
images/
├── mbeir_images/
│ └── oven_images/
│ └── ... .jpg (InfoSeek)
├── ArxivQA/
│ └── images/
│ └── ... .jpg (ArxivQA)
└── TAT-DQA/
│ └── ... .png (TAT-DQA)
└── A-OKVQA/
└── Train/
│ └── ... .jpg (A-OKVQA)
│
... (MMEB Training images)
```
You can refer to the image paths in each subset to view the image organization.
You can also customize your image paths by altering the image_path fields.
## Citation
If you use this dataset in your research, please cite the associated paper.
```
@article{chen2025mmE5,
title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data},
author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng},
journal={arXiv preprint arXiv:2502.08468},
year={2025}
}
```
# mmE5标注数据集
本数据集包含用于mmE5([mmE5:基于高质量合成数据优化多模态多语言嵌入](https://arxiv.org/abs/2502.08468))监督微调的各类子数据集:
- **MMEB**(带难例负样本)
- **InfoSeek**(源自M-BEIR)
- **TAT-DQA**
- **ArxivQA**
[GitHub仓库](https://github.com/haon-chen/mmE5)
## 图像准备
首先,请准备训练所需的图像数据:
### 图像下载
- **下载链接**:通过以下链接获取各数据集对应的图像资源:
- [**MMEB**](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)
- [**InfoSeek**](https://huggingface.co/datasets/TIGER-Lab/M-BEIR)
- [**ArxivQA**](https://huggingface.co/datasets/MMInstruction/ArxivQA)
- [**TAT-DQA**](https://huggingface.co/datasets/vidore/tatdqa_train/tree/main)
针对TAT-DQA数据集,需先将图像保存至统一图像目录以规范使用:
dataset = load_dataset(
"vidore/tatdqa_train",
split="train"
)
image_out_dir = "images/TAT-DQA"
os.makedirs(image_out_dir, exist_ok=True)
for i, sample in enumerate(dataset):
save_path = os.path.join(image_out_dir, f"tatdqa_{i}.png")
if os.path.exists(save_path):
continue
image = sample["image"]
image.save(save_path, format="PNG")
### 图像组织规范
images/
├── mbeir_images/
│ └── oven_images/
│ └── ... .jpg (InfoSeek数据集图像)
├── ArxivQA/
│ └── images/
│ └── ... .jpg (ArxivQA数据集图像)
└── TAT-DQA/
│ └── ... .png (TAT-DQA数据集图像)
└── A-OKVQA/
└── Train/
│ └── ... .jpg (A-OKVQA数据集图像)
│
... (MMEB训练图像)
可参考各子数据集对应的图像路径以了解图像组织方式。
也可通过修改`image_path`字段自定义图像路径。
## 引用声明
若您在研究中使用本数据集,请引用以下相关论文:
@article{chen2025mmE5,
title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data},
author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng},
journal={arXiv preprint arXiv:2502.08468},
year={2025}
}
提供机构:
maas
创建时间:
2025-02-14
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集用于多模态多语言嵌入模型mmE5的有监督微调,包含MMEB、InfoSeek、TAT-DQA和ArxivQA四个标注子集。使用前需按照指定方式准备相关图像数据。
以上内容由遇见数据集搜集并总结生成



