laughatwill/TIGER-Lab_MMEB-train
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/laughatwill/TIGER-Lab_MMEB-train
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 1M<n<10M
pretty_name: MMEB-train-lance
tags:
- embedding
- lance
- multimodal
configs:
- config_name: A-OKVQA
data_files:
- split: train
path: data/A-OKVQA/train.lance/**
- split: original
path: data/A-OKVQA/original.lance/**
- split: diverse_instruction
path: data/A-OKVQA/diverse.lance/**
- config_name: ChartQA
data_files:
- split: train
path: data/ChartQA/train.lance/**
- split: original
path: data/ChartQA/original.lance/**
- split: diverse_instruction
path: data/ChartQA/diverse.lance/**
- config_name: CIRR
data_files:
- split: train
path: data/CIRR/train.lance/**
- split: original
path: data/CIRR/original.lance/**
- split: diverse_instruction
path: data/CIRR/diverse.lance/**
- config_name: DocVQA
data_files:
- split: train
path: data/DocVQA/train.lance/**
- split: original
path: data/DocVQA/original.lance/**
- split: diverse_instruction
path: data/DocVQA/diverse.lance/**
- config_name: HatefulMemes
data_files:
- split: train
path: data/HatefulMemes/train.lance/**
- split: original
path: data/HatefulMemes/original.lance/**
- split: diverse_instruction
path: data/HatefulMemes/diverse.lance/**
- config_name: ImageNet_1K
data_files:
- split: train
path: data/ImageNet_1K/train.lance/**
- split: original
path: data/ImageNet_1K/original.lance/**
- split: diverse_instruction
path: data/ImageNet_1K/diverse.lance/**
- config_name: InfographicsVQA
data_files:
- split: train
path: data/InfographicsVQA/train.lance/**
- split: original
path: data/InfographicsVQA/original.lance/**
- split: diverse_instruction
path: data/InfographicsVQA/diverse.lance/**
- config_name: MSCOCO
data_files:
- split: train
path: data/MSCOCO/train.lance/**
- split: original
path: data/MSCOCO/original.lance/**
- split: diverse_instruction
path: data/MSCOCO/diverse.lance/**
- config_name: MSCOCO_i2t
data_files:
- split: train
path: data/MSCOCO_i2t/train.lance/**
- split: original
path: data/MSCOCO_i2t/original.lance/**
- split: diverse_instruction
path: data/MSCOCO_i2t/diverse.lance/**
- config_name: MSCOCO_t2i
data_files:
- split: train
path: data/MSCOCO_t2i/train.lance/**
- split: original
path: data/MSCOCO_t2i/original.lance/**
- split: diverse_instruction
path: data/MSCOCO_t2i/diverse.lance/**
- config_name: N24News
data_files:
- split: train
path: data/N24News/train.lance/**
- split: original
path: data/N24News/original.lance/**
- split: diverse_instruction
path: data/N24News/diverse.lance/**
- config_name: NIGHTS
data_files:
- split: train
path: data/NIGHTS/train.lance/**
- split: original
path: data/NIGHTS/original.lance/**
- split: diverse_instruction
path: data/NIGHTS/diverse.lance/**
- config_name: OK-VQA
data_files:
- split: train
path: data/OK-VQA/train.lance/**
- split: original
path: data/OK-VQA/original.lance/**
- split: diverse_instruction
path: data/OK-VQA/diverse.lance/**
- config_name: SUN397
data_files:
- split: train
path: data/SUN397/train.lance/**
- split: original
path: data/SUN397/original.lance/**
- split: diverse_instruction
path: data/SUN397/diverse.lance/**
- config_name: VOC2007
data_files:
- split: train
path: data/VOC2007/train.lance/**
- split: original
path: data/VOC2007/original.lance/**
- split: diverse_instruction
path: data/VOC2007/diverse.lance/**
- config_name: VisDial
data_files:
- split: train
path: data/VisDial/train.lance/**
- split: original
path: data/VisDial/original.lance/**
- split: diverse_instruction
path: data/VisDial/diverse.lance/**
- config_name: Visual7W
data_files:
- split: train
path: data/Visual7W/train.lance/**
- split: original
path: data/Visual7W/original.lance/**
- split: diverse_instruction
path: data/Visual7W/diverse.lance/**
- config_name: VisualNews_i2t
data_files:
- split: train
path: data/VisualNews_i2t/train.lance/**
- split: original
path: data/VisualNews_i2t/original.lance/**
- split: diverse_instruction
path: data/VisualNews_i2t/diverse.lance/**
- config_name: VisualNews_t2i
data_files:
- split: train
path: data/VisualNews_t2i/train.lance/**
- split: original
path: data/VisualNews_t2i/original.lance/**
- split: diverse_instruction
path: data/VisualNews_t2i/diverse.lance/**
- config_name: WebQA
data_files:
- split: train
path: data/WebQA/train.lance/**
- split: original
path: data/WebQA/original.lance/**
- split: diverse_instruction
path: data/WebQA/diverse.lance/**
- config_name: images
data_files: data/images/**
---
# MMEB Training Dataset (Lance Format)
This is a **Lance-format** version of the [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) dataset, optimized for efficient storage and fast random access.
The original dataset is used for training VLM2Vec models in the paper [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160) (ICLR 2025).
## Directory Structure
```
TIGER-Lab_MMEB-train/
└── data/
├── A-OKVQA/
│ ├── train.lance
│ ├── original.lance
│ └── diverse.lance
├── MSCOCO/
│ └── ...
└── images/
├── A-OKVQA.lance
├── MSCOCO.lance
└── ...
```
## Schema
### Metadata (`{dataset}/{variant}.lance`)
| Field | Type | Description |
|-------|------|-------------|
| `qry` | string | Query text (may contain `<\|image_1\|>` placeholder) |
| `qry_image_id` | string | Query image path (empty if text-only) |
| `pos_text` | string | Positive sample text |
| `pos_image_id` | string | Positive sample image path |
| `neg_text` | string | Negative sample text (optional) |
| `neg_image_id` | string | Negative sample image path (optional) |
### Images (`images/{dataset}.lance`)
| Field | Type | Description |
|-------|------|-------------|
| `image_id` | string | Image path identifier |
| `data` | binary | Image binary data (JPEG) |
## Dataset Statistics
| Dataset | Samples | Images |
|---------|---------|--------|
| A-OKVQA | 17,056 | 17,056 |
| ChartQA | 28,299 | 28,299 |
| CIRR | 26,116 | 16,640 |
| DocVQA | 39,463 | 39,463 |
| HatefulMemes | 8,500 | 8,500 |
| ImageNet_1K | 100,000 | 100,000 |
| InfographicsVQA | 23,946 | 4,406 |
| MSCOCO | 100,000 | 59,969 |
| MSCOCO_i2t | 113,287 | 113,287 |
| MSCOCO_t2i | 100,000 | 70,414 |
| N24News | 48,988 | 48,988 |
| NIGHTS | 15,941 | 31,882 |
| OK-VQA | 9,009 | 9,009 |
| SUN397 | 19,850 | 19,850 |
| VisDial | 123,287 | 123,287 |
| Visual7W | 69,817 | 14,366 |
| VisualNews_i2t | 100,000 | 100,000 |
| VisualNews_t2i | 99,903 | 99,903 |
| VOC2007 | 7,844 | 7,844 |
| WebQA | 17,166 | 12,873 |
Each dataset has 3 variants: `train`, `original`, and `diverse_instruction` (same sample count, different instruction templates).
## Original Dataset
This dataset is derived from [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train). For evaluation, please refer to [TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval).
## Citation
```bibtex
@article{jiang2024vlm2vec,
title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
journal={arXiv preprint arXiv:2410.05160},
year={2024}
}
```
## License
Apache-2.0 (same as the original dataset)
语言:
- 英语
许可证:Apache-2.0
样本规模范围:
- 100万 < 样本数量 < 1000万
展示名称:MMEB-train-lance
标签:
- 嵌入(embedding)
- Lance
- 多模态(multimodal)
配置项包含A-OKVQA、ChartQA、CIRR、DocVQA、HatefulMemes、ImageNet_1K、InfographicsVQA、MSCOCO、MSCOCO_i2t、MSCOCO_t2i、N24News、NIGHTS、OK-VQA、SUN397、VOC2007、VisDial、Visual7W、VisualNews_i2t、VisualNews_t2i、WebQA、images,每个配置项均包含三类数据文件拆分:训练集(train)、原始集(original)与多样化指令集(diverse_instruction),路径格式为`data/{数据集名称}/{拆分名}.lance/**`,其中images配置项的路径为`data/images/**`。
# MMEB训练数据集(Lance格式)
本数据集为[TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)数据集的**Lance格式**版本,针对高效存储与快速随机访问进行了优化。
该原始数据集用于论文《VLM2Vec:面向大规模多模态嵌入任务的视觉语言模型(Vision-Language Model)训练》[VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160)(国际学习表征会议2025,ICLR 2025)中的VLM2Vec模型训练。
## 目录结构
TIGER-Lab_MMEB-train/
└── data/
├── A-OKVQA/
│ ├── train.lance
│ ├── original.lance
│ └── diverse.lance
├── MSCOCO/
│ └── 其余相关文件
└── images/
├── A-OKVQA.lance
├── MSCOCO.lance
└── 其余相关文件
## 数据模式
### 元数据文件(`{dataset}/{variant}.lance`)
| 字段名 | 数据类型 | 描述 |
|-------|------|-------------|
| `qry` | 字符串 | 查询文本,可包含`<|image_1|>`占位符 |
| `qry_image_id` | 字符串 | 查询图像路径,仅文本任务时为空 |
| `pos_text` | 字符串 | 正样本文本 |
| `pos_image_id` | 字符串 | 正样本图像路径 |
| `neg_text` | 字符串 | 负样本文本(可选) |
| `neg_image_id` | 字符串 | 负样本图像路径(可选) |
### 图像数据文件(`images/{dataset}.lance`)
| 字段名 | 数据类型 | 描述 |
|-------|------|-------------|
| `image_id` | 字符串 | 图像路径标识符 |
| `data` | 二进制数据 | 图像二进制数据,格式为JPEG |
## 数据集统计信息
| 数据集名称 | 样本数量 | 图像数量 |
|---------|---------|--------|
| A-OKVQA | 17,056 | 17,056 |
| ChartQA | 28,299 | 28,299 |
| CIRR | 26,116 | 16,640 |
| DocVQA | 39,463 | 39,463 |
| HatefulMemes | 8,500 | 8,500 |
| ImageNet_1K | 100,000 | 100,000 |
| InfographicsVQA | 23,946 | 4,406 |
| MSCOCO | 100,000 | 59,969 |
| MSCOCO_i2t | 113,287 | 113,287 |
| MSCOCO_t2i | 100,000 | 70,414 |
| N24News | 48,988 | 48,988 |
| NIGHTS | 15,941 | 31,882 |
| OK-VQA | 9,009 | 9,009 |
| SUN397 | 19,850 | 19,850 |
| VisDial | 123,287 | 123,287 |
| Visual7W | 69,817 | 14,366 |
| VisualNews_i2t | 100,000 | 100,000 |
| VisualNews_t2i | 99,903 | 99,903 |
| VOC2007 | 7,844 | 7,844 |
| WebQA | 17,166 | 12,873 |
每个数据集均包含3个变体:训练集、原始集与多样化指令集,三者样本数量一致,仅指令模板存在差异。
## 原始数据集
本数据集派生自[TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)。如需进行模型评估,请参考[TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval)数据集。
## 引用文献
bibtex
@article{jiang2024vlm2vec,
title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
journal={arXiv预印本 arXiv:2410.05160},
year={2024}
}
## 许可证
Apache-2.0(与原始数据集保持一致)
提供机构:
laughatwill



