five

mmE5-MMEB-hardneg

收藏
魔搭社区2026-04-28 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/intfloat/mmE5-MMEB-hardneg
下载链接
链接失效反馈
官方服务:
资源简介:
# mmE5 Labeled Data This dataset contains datasets used for the supervised finetuning of mmE5 ([mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data](https://arxiv.org/abs/2502.08468)): - **MMEB** (with hard negative) - **InfoSeek** (from M-BEIR) - **TAT-DQA** - **ArxivQA** [Github](https://github.com/haon-chen/mmE5) ## Image Preparation First, you should prepare the images used for training: ### Image Downloads - **Download All Images Used in mmE5**: You can use the script provided in our [source code](https://github.com/haon-chen/mmE5) to download all images used in mmE5. ```bash git clone https://github.com/haon-chen/mmE5.git cd mmE5 bash scripts/prepare_images.sh ``` ### Image Organization ``` images/ ├── mbeir_images/ │ └── oven_images/ │ └── ... .jpg (InfoSeek) ├── ArxivQA/ │ └── images/ │ └── ... .jpg (ArxivQA) └── TAT-DQA/ │ └── ... .png (TAT-DQA) └── A-OKVQA/ └── Train/ │ └── ... .jpg (A-OKVQA) │ ... (MMEB Training images) ``` You can refer to the image paths in each subset to view the image organization. You can also customize your image paths by altering the image_path fields. ## Citation If you use this dataset in your research, please cite the associated paper. ``` @article{chen2025mmE5, title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data}, author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng}, journal={arXiv preprint arXiv:2502.08468}, year={2025} } ```

# mmE5标注数据集 本数据集包含用于mmE5([mmE5:基于高质量合成数据优化多模态多语言嵌入](https://arxiv.org/abs/2502.08468))监督微调的各类子数据集: - **MMEB**(带难例负样本) - **InfoSeek**(源自M-BEIR) - **TAT-DQA** - **ArxivQA** [GitHub仓库](https://github.com/haon-chen/mmE5) ## 图像准备 首先,请准备训练所需的图像数据: ### 图像下载 - **下载链接**:通过以下链接获取各数据集对应的图像资源: - [**MMEB**](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) - [**InfoSeek**](https://huggingface.co/datasets/TIGER-Lab/M-BEIR) - [**ArxivQA**](https://huggingface.co/datasets/MMInstruction/ArxivQA) - [**TAT-DQA**](https://huggingface.co/datasets/vidore/tatdqa_train/tree/main) 针对TAT-DQA数据集,需先将图像保存至统一图像目录以规范使用: dataset = load_dataset( "vidore/tatdqa_train", split="train" ) image_out_dir = "images/TAT-DQA" os.makedirs(image_out_dir, exist_ok=True) for i, sample in enumerate(dataset): save_path = os.path.join(image_out_dir, f"tatdqa_{i}.png") if os.path.exists(save_path): continue image = sample["image"] image.save(save_path, format="PNG") ### 图像组织规范 images/ ├── mbeir_images/ │ └── oven_images/ │ └── ... .jpg (InfoSeek数据集图像) ├── ArxivQA/ │ └── images/ │ └── ... .jpg (ArxivQA数据集图像) └── TAT-DQA/ │ └── ... .png (TAT-DQA数据集图像) └── A-OKVQA/ └── Train/ │ └── ... .jpg (A-OKVQA数据集图像) │ ... (MMEB训练图像) 可参考各子数据集对应的图像路径以了解图像组织方式。 也可通过修改`image_path`字段自定义图像路径。 ## 引用声明 若您在研究中使用本数据集,请引用以下相关论文: @article{chen2025mmE5, title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data}, author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng}, journal={arXiv preprint arXiv:2502.08468}, year={2025} }
提供机构:
maas
创建时间:
2025-02-14
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集用于多模态多语言嵌入模型mmE5的有监督微调,包含MMEB、InfoSeek、TAT-DQA和ArxivQA四个标注子集。使用前需按照指定方式准备相关图像数据。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作