five

laughatwill/TIGER-Lab_MMEB-train

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/laughatwill/TIGER-Lab_MMEB-train
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 1M<n<10M pretty_name: MMEB-train-lance tags: - embedding - lance - multimodal configs: - config_name: A-OKVQA data_files: - split: train path: data/A-OKVQA/train.lance/** - split: original path: data/A-OKVQA/original.lance/** - split: diverse_instruction path: data/A-OKVQA/diverse.lance/** - config_name: ChartQA data_files: - split: train path: data/ChartQA/train.lance/** - split: original path: data/ChartQA/original.lance/** - split: diverse_instruction path: data/ChartQA/diverse.lance/** - config_name: CIRR data_files: - split: train path: data/CIRR/train.lance/** - split: original path: data/CIRR/original.lance/** - split: diverse_instruction path: data/CIRR/diverse.lance/** - config_name: DocVQA data_files: - split: train path: data/DocVQA/train.lance/** - split: original path: data/DocVQA/original.lance/** - split: diverse_instruction path: data/DocVQA/diverse.lance/** - config_name: HatefulMemes data_files: - split: train path: data/HatefulMemes/train.lance/** - split: original path: data/HatefulMemes/original.lance/** - split: diverse_instruction path: data/HatefulMemes/diverse.lance/** - config_name: ImageNet_1K data_files: - split: train path: data/ImageNet_1K/train.lance/** - split: original path: data/ImageNet_1K/original.lance/** - split: diverse_instruction path: data/ImageNet_1K/diverse.lance/** - config_name: InfographicsVQA data_files: - split: train path: data/InfographicsVQA/train.lance/** - split: original path: data/InfographicsVQA/original.lance/** - split: diverse_instruction path: data/InfographicsVQA/diverse.lance/** - config_name: MSCOCO data_files: - split: train path: data/MSCOCO/train.lance/** - split: original path: data/MSCOCO/original.lance/** - split: diverse_instruction path: data/MSCOCO/diverse.lance/** - config_name: MSCOCO_i2t data_files: - split: train path: data/MSCOCO_i2t/train.lance/** - split: original path: data/MSCOCO_i2t/original.lance/** - split: diverse_instruction path: data/MSCOCO_i2t/diverse.lance/** - config_name: MSCOCO_t2i data_files: - split: train path: data/MSCOCO_t2i/train.lance/** - split: original path: data/MSCOCO_t2i/original.lance/** - split: diverse_instruction path: data/MSCOCO_t2i/diverse.lance/** - config_name: N24News data_files: - split: train path: data/N24News/train.lance/** - split: original path: data/N24News/original.lance/** - split: diverse_instruction path: data/N24News/diverse.lance/** - config_name: NIGHTS data_files: - split: train path: data/NIGHTS/train.lance/** - split: original path: data/NIGHTS/original.lance/** - split: diverse_instruction path: data/NIGHTS/diverse.lance/** - config_name: OK-VQA data_files: - split: train path: data/OK-VQA/train.lance/** - split: original path: data/OK-VQA/original.lance/** - split: diverse_instruction path: data/OK-VQA/diverse.lance/** - config_name: SUN397 data_files: - split: train path: data/SUN397/train.lance/** - split: original path: data/SUN397/original.lance/** - split: diverse_instruction path: data/SUN397/diverse.lance/** - config_name: VOC2007 data_files: - split: train path: data/VOC2007/train.lance/** - split: original path: data/VOC2007/original.lance/** - split: diverse_instruction path: data/VOC2007/diverse.lance/** - config_name: VisDial data_files: - split: train path: data/VisDial/train.lance/** - split: original path: data/VisDial/original.lance/** - split: diverse_instruction path: data/VisDial/diverse.lance/** - config_name: Visual7W data_files: - split: train path: data/Visual7W/train.lance/** - split: original path: data/Visual7W/original.lance/** - split: diverse_instruction path: data/Visual7W/diverse.lance/** - config_name: VisualNews_i2t data_files: - split: train path: data/VisualNews_i2t/train.lance/** - split: original path: data/VisualNews_i2t/original.lance/** - split: diverse_instruction path: data/VisualNews_i2t/diverse.lance/** - config_name: VisualNews_t2i data_files: - split: train path: data/VisualNews_t2i/train.lance/** - split: original path: data/VisualNews_t2i/original.lance/** - split: diverse_instruction path: data/VisualNews_t2i/diverse.lance/** - config_name: WebQA data_files: - split: train path: data/WebQA/train.lance/** - split: original path: data/WebQA/original.lance/** - split: diverse_instruction path: data/WebQA/diverse.lance/** - config_name: images data_files: data/images/** --- # MMEB Training Dataset (Lance Format) This is a **Lance-format** version of the [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) dataset, optimized for efficient storage and fast random access. The original dataset is used for training VLM2Vec models in the paper [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160) (ICLR 2025). ## Directory Structure ``` TIGER-Lab_MMEB-train/ └── data/ ├── A-OKVQA/ │ ├── train.lance │ ├── original.lance │ └── diverse.lance ├── MSCOCO/ │ └── ... └── images/ ├── A-OKVQA.lance ├── MSCOCO.lance └── ... ``` ## Schema ### Metadata (`{dataset}/{variant}.lance`) | Field | Type | Description | |-------|------|-------------| | `qry` | string | Query text (may contain `<\|image_1\|>` placeholder) | | `qry_image_id` | string | Query image path (empty if text-only) | | `pos_text` | string | Positive sample text | | `pos_image_id` | string | Positive sample image path | | `neg_text` | string | Negative sample text (optional) | | `neg_image_id` | string | Negative sample image path (optional) | ### Images (`images/{dataset}.lance`) | Field | Type | Description | |-------|------|-------------| | `image_id` | string | Image path identifier | | `data` | binary | Image binary data (JPEG) | ## Dataset Statistics | Dataset | Samples | Images | |---------|---------|--------| | A-OKVQA | 17,056 | 17,056 | | ChartQA | 28,299 | 28,299 | | CIRR | 26,116 | 16,640 | | DocVQA | 39,463 | 39,463 | | HatefulMemes | 8,500 | 8,500 | | ImageNet_1K | 100,000 | 100,000 | | InfographicsVQA | 23,946 | 4,406 | | MSCOCO | 100,000 | 59,969 | | MSCOCO_i2t | 113,287 | 113,287 | | MSCOCO_t2i | 100,000 | 70,414 | | N24News | 48,988 | 48,988 | | NIGHTS | 15,941 | 31,882 | | OK-VQA | 9,009 | 9,009 | | SUN397 | 19,850 | 19,850 | | VisDial | 123,287 | 123,287 | | Visual7W | 69,817 | 14,366 | | VisualNews_i2t | 100,000 | 100,000 | | VisualNews_t2i | 99,903 | 99,903 | | VOC2007 | 7,844 | 7,844 | | WebQA | 17,166 | 12,873 | Each dataset has 3 variants: `train`, `original`, and `diverse_instruction` (same sample count, different instruction templates). ## Original Dataset This dataset is derived from [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train). For evaluation, please refer to [TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval). ## Citation ```bibtex @article{jiang2024vlm2vec, title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks}, author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu}, journal={arXiv preprint arXiv:2410.05160}, year={2024} } ``` ## License Apache-2.0 (same as the original dataset)

语言: - 英语 许可证:Apache-2.0 样本规模范围: - 100万 < 样本数量 < 1000万 展示名称:MMEB-train-lance 标签: - 嵌入(embedding) - Lance - 多模态(multimodal) 配置项包含A-OKVQA、ChartQA、CIRR、DocVQA、HatefulMemes、ImageNet_1K、InfographicsVQA、MSCOCO、MSCOCO_i2t、MSCOCO_t2i、N24News、NIGHTS、OK-VQA、SUN397、VOC2007、VisDial、Visual7W、VisualNews_i2t、VisualNews_t2i、WebQA、images,每个配置项均包含三类数据文件拆分:训练集(train)、原始集(original)与多样化指令集(diverse_instruction),路径格式为`data/{数据集名称}/{拆分名}.lance/**`,其中images配置项的路径为`data/images/**`。 # MMEB训练数据集(Lance格式) 本数据集为[TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)数据集的**Lance格式**版本,针对高效存储与快速随机访问进行了优化。 该原始数据集用于论文《VLM2Vec:面向大规模多模态嵌入任务的视觉语言模型(Vision-Language Model)训练》[VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160)(国际学习表征会议2025,ICLR 2025)中的VLM2Vec模型训练。 ## 目录结构 TIGER-Lab_MMEB-train/ └── data/ ├── A-OKVQA/ │ ├── train.lance │ ├── original.lance │ └── diverse.lance ├── MSCOCO/ │ └── 其余相关文件 └── images/ ├── A-OKVQA.lance ├── MSCOCO.lance └── 其余相关文件 ## 数据模式 ### 元数据文件(`{dataset}/{variant}.lance`) | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `qry` | 字符串 | 查询文本,可包含`<|image_1|>`占位符 | | `qry_image_id` | 字符串 | 查询图像路径,仅文本任务时为空 | | `pos_text` | 字符串 | 正样本文本 | | `pos_image_id` | 字符串 | 正样本图像路径 | | `neg_text` | 字符串 | 负样本文本(可选) | | `neg_image_id` | 字符串 | 负样本图像路径(可选) | ### 图像数据文件(`images/{dataset}.lance`) | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `image_id` | 字符串 | 图像路径标识符 | | `data` | 二进制数据 | 图像二进制数据,格式为JPEG | ## 数据集统计信息 | 数据集名称 | 样本数量 | 图像数量 | |---------|---------|--------| | A-OKVQA | 17,056 | 17,056 | | ChartQA | 28,299 | 28,299 | | CIRR | 26,116 | 16,640 | | DocVQA | 39,463 | 39,463 | | HatefulMemes | 8,500 | 8,500 | | ImageNet_1K | 100,000 | 100,000 | | InfographicsVQA | 23,946 | 4,406 | | MSCOCO | 100,000 | 59,969 | | MSCOCO_i2t | 113,287 | 113,287 | | MSCOCO_t2i | 100,000 | 70,414 | | N24News | 48,988 | 48,988 | | NIGHTS | 15,941 | 31,882 | | OK-VQA | 9,009 | 9,009 | | SUN397 | 19,850 | 19,850 | | VisDial | 123,287 | 123,287 | | Visual7W | 69,817 | 14,366 | | VisualNews_i2t | 100,000 | 100,000 | | VisualNews_t2i | 99,903 | 99,903 | | VOC2007 | 7,844 | 7,844 | | WebQA | 17,166 | 12,873 | 每个数据集均包含3个变体:训练集、原始集与多样化指令集,三者样本数量一致,仅指令模板存在差异。 ## 原始数据集 本数据集派生自[TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)。如需进行模型评估,请参考[TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval)数据集。 ## 引用文献 bibtex @article{jiang2024vlm2vec, title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks}, author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu}, journal={arXiv预印本 arXiv:2410.05160}, year={2024} } ## 许可证 Apache-2.0(与原始数据集保持一致)
提供机构:
laughatwill
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作