mmE5-MMEB-hardneg

Name: mmE5-MMEB-hardneg
Creator: maas
Published: 2026-04-28 16:23:03
License: 暂无描述

魔搭社区2026-04-28 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/intfloat/mmE5-MMEB-hardneg

下载链接

链接失效反馈

官方服务：

资源简介：

# mmE5 Labeled Data This dataset contains datasets used for the supervised finetuning of mmE5 ([mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data](https://arxiv.org/abs/2502.08468)): - **MMEB** (with hard negative) - **InfoSeek** (from M-BEIR) - **TAT-DQA** - **ArxivQA** [Github](https://github.com/haon-chen/mmE5) ## Image Preparation First, you should prepare the images used for training: ### Image Downloads - **Download All Images Used in mmE5**: You can use the script provided in our [source code](https://github.com/haon-chen/mmE5) to download all images used in mmE5. ```bash git clone https://github.com/haon-chen/mmE5.git cd mmE5 bash scripts/prepare_images.sh ``` ### Image Organization ``` images/ ├── mbeir_images/ │ └── oven_images/ │ └── ... .jpg (InfoSeek) ├── ArxivQA/ │ └── images/ │ └── ... .jpg (ArxivQA) └── TAT-DQA/ │ └── ... .png (TAT-DQA) └── A-OKVQA/ └── Train/ │ └── ... .jpg (A-OKVQA) │ ... (MMEB Training images) ``` You can refer to the image paths in each subset to view the image organization. You can also customize your image paths by altering the image_path fields. ## Citation If you use this dataset in your research, please cite the associated paper. ``` @article{chen2025mmE5, title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data}, author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng}, journal={arXiv preprint arXiv:2502.08468}, year={2025} } ```

# mmE5标注数据集本数据集包含用于mmE5（[mmE5：基于高质量合成数据优化多模态多语言嵌入](https://arxiv.org/abs/2502.08468)）监督微调的各类子数据集： - **MMEB**（带难例负样本） - **InfoSeek**（源自M-BEIR） - **TAT-DQA** - **ArxivQA** [GitHub仓库](https://github.com/haon-chen/mmE5) ## 图像准备首先，请准备训练所需的图像数据： ### 图像下载 - **下载链接**：通过以下链接获取各数据集对应的图像资源： - [**MMEB**](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) - [**InfoSeek**](https://huggingface.co/datasets/TIGER-Lab/M-BEIR) - [**ArxivQA**](https://huggingface.co/datasets/MMInstruction/ArxivQA) - [**TAT-DQA**](https://huggingface.co/datasets/vidore/tatdqa_train/tree/main) 针对TAT-DQA数据集，需先将图像保存至统一图像目录以规范使用： dataset = load_dataset( "vidore/tatdqa_train", split="train" ) image_out_dir = "images/TAT-DQA" os.makedirs(image_out_dir, exist_ok=True) for i, sample in enumerate(dataset): save_path = os.path.join(image_out_dir, f"tatdqa_{i}.png") if os.path.exists(save_path): continue image = sample["image"] image.save(save_path, format="PNG") ### 图像组织规范 images/ ├── mbeir_images/ │ └── oven_images/ │ └── ... .jpg (InfoSeek数据集图像) ├── ArxivQA/ │ └── images/ │ └── ... .jpg (ArxivQA数据集图像) └── TAT-DQA/ │ └── ... .png (TAT-DQA数据集图像) └── A-OKVQA/ └── Train/ │ └── ... .jpg (A-OKVQA数据集图像) │ ... (MMEB训练图像) 可参考各子数据集对应的图像路径以了解图像组织方式。也可通过修改`image_path`字段自定义图像路径。 ## 引用声明若您在研究中使用本数据集，请引用以下相关论文： @article{chen2025mmE5, title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data}, author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng}, journal={arXiv preprint arXiv:2502.08468}, year={2025} }

提供机构：

maas

创建时间：

2025-02-14

搜集汇总

数据集介绍