下载链接：

https://modelscope.cn/datasets/OpenGVLab/V2PE-Data

下载链接

链接失效反馈

官方服务：

资源简介：

# V2PE-Data [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[🆕 Blog\]](https://zzdhybthu.github.io/V2PE.github.io/) [\[📜 Paper\]](https://arxiv.org/abs/2412.09616) [\[🤗 HF Models\]](https://huggingface.co/OpenGVLab/V2PE) ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/ewbZmWctNv-uLFlnMCGK9.png) ## Summary We introduce two augmented long-context multimodal datasets: **Long Visual Question Answering** and **Long multimodal Retrieval**. These datasets aim to enhance VLMs' long-context training and establish a systematic evaluation framework, thereby addressing the challenges associated with long-context understanding that extend beyond the scope of existing training data. ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/93ts7Q204GAX-Lu6tLnY8.png) - **Long Visual Question Answering (Long-VQA):** The Long-VQA dataset aims to evaluate the capabilities of VLMs in understanding and reasoning over long multimodal sequences within general visual question-answering tasks. We extended 17 widely adopted datasets (e.g., DocVQA, GQA, SQA), expanding their content from short sequences to those containing up to 32K tokens. The tasks involve answering questions that require commonsense reasoning, factual knowledge, and interpretation of visual information from charts, documents, and real-world texts. Long-VQA contains 533K samples: 392K for training (up to 32K tokens) and 141K for validation (up to 64K tokens) to evaluate the generalization to longer contexts. ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/gkfXER4GLtFGYpjQ0gu7G.png) - **Long Multimodal Retrieval (Long-MR):** we developed Long-MR by inserting a target image or textual segment into sequences of interleaved images and texts. Long-MR evaluates VLMs' ability to retrieve specific targets from ultra-long multimodal sequences, requiring models to locate the inserted "needle" and answer associated questions. We generated two subsets of Long-MR: Long-MR-32K (488K samples, sequences up to 32K tokens) and Long-MR-256K (50K samples, sequences up to 256K tokens), following the data construction process of MM-NIAH. To assess the limits of VLMs' long-context capabilities, we further extend the official MM-NIAH evaluation benchmark by generating testing samples with sequence lengths ranging from 64K to 1M tokens, resulting in the MM-NIAH-1M benchmark. This extension pushes the testing capacity beyond the original MM-NIAH, which was limited to sequences of up to 64K tokens. ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/mEpfOPY0gue_BHDDNCOMH.png) Please refer to our [paper](https://arxiv.org/abs/2412.09616) for more details. ## Evaluation Results of [Released Model](https://huggingface.co/OpenGVLab/V2PE) **General MLLM Benchmarks** | Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMUval | MMBenchEN | SEEDI | Avg | |---------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------| | InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 | | DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - | | Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - | | Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 | | MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - | | Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - | | Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - | | Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 | | GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - | | **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** | **Long-Context MLLM Benchmarks** | Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench | |--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------| | InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - | | Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - | | OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 | | LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 | | LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 | | VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - | | Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - | | GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 | | GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - | | Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - | | **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** | ## Usage Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE?tab=readme-ov-file#prepare-training-datasets). ## Citation If you find this work helpful in your research, please consider citing: ```bibtex @misc{ge2024v2peimprovingmultimodallongcontext, title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding}, author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu}, year={2024}, eprint={2412.09616}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.09616}, } ``

# V2PE数据集 [📂 GitHub仓库](https://github.com/OpenGVLab/V2PE) [🆕 博客](https://zzdhybthu.github.io/V2PE.github.io/) [📜 论文](https://arxiv.org/abs/2412.09616) [🤗 Hugging Face 模型(Hugging Face, HF)](https://huggingface.co/OpenGVLab/V2PE) ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/ewbZmWctNv-uLFlnMCGK9.png) ## 概述我们提出了两个增强型长上下文多模态数据集：**长视觉问答（Long Visual Question Answering，简称Long-VQA）** 与 **长多模态检索（Long Multimodal Retrieval，简称Long-MR）**。本数据集旨在提升视觉语言模型（Vision-Language Model, VLM）的长上下文训练能力，并构建系统化的评估框架，从而解决现有训练数据覆盖范围之外的长上下文理解相关挑战。 ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/93ts7Q204GAX-Lu6tLnY8.png) - **长视觉问答（Long-VQA）**：Long-VQA数据集旨在评估视觉语言模型在通用视觉问答任务中理解并推理长多模态序列的能力。我们对17个广泛使用的数据集（如DocVQA、GQA、SQA）进行了扩展，将其内容从短序列拓展至最多包含32K个Token的长序列。该任务要求模型回答需要常识推理、事实知识以及解读图表、文档和现实文本中视觉信息的问题。Long-VQA共包含53.3万个样本：其中39.2万用于训练（序列长度最高32K Token），14.1万用于验证（序列长度最高64K Token），以评估模型对更长上下文的泛化能力。 ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/gkfXER4GLtFGYpjQ0gu7G.png) - **长多模态检索（Long-MR）**：我们通过将目标图像或文本片段插入交错的图像与文本序列中，构建了Long-MR数据集。该数据集用于评估视觉语言模型从超长多模态序列中检索特定目标的能力，要求模型定位插入的“目标线索”并回答相关问题。我们参考MM-NIAH的数据构建流程，生成了Long-MR的两个子集：Long-MR-32K（48.8万个样本，序列长度最高32K Token）与Long-MR-256K（5万个样本，序列长度最高256K Token）。为了评估视觉语言模型长上下文能力的边界，我们进一步扩展了官方MM-NIAH评估基准，生成了序列长度从64K到1M Token的测试样本，由此得到MM-NIAH-1M基准。该扩展将测试能力边界从原MM-NIAH仅支持最高64K Token序列的限制中突破。 ![image.png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/mEpfOPY0gue_BHDDNCOMH.png) 更多细节请参考我们的[论文](https://arxiv.org/abs/2412.09616)。 ## 已发布模型的评估结果 ### 通用多模态大语言模型基准测试 | 模型名称 | 参数量 | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU验证集 | MMBench英文 | SEEDI | 平均分 | |---------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------| | InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 | | DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - | | Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - | | Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 | | MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - | | Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - | | Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - | | Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 | | GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - | | **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** | ### 长上下文多模态大语言模型基准测试 | 模型名称 | 参数量 | MM-NIAH/图像 | MM-NIAH/文本 | MM-NIAH/平均分 | Milebench/T | Milebench/S | Milebench/NI | Milebench/平均分 | VideoMME | MVBench | |--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------| | InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - | | Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - | | OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 | | LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 | | LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 | | VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - | | Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - | | GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 | | GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - | | Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - | | **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** | ## 使用方法请参考我们的[GitHub仓库](https://github.com/OpenGVLab/V2PE?tab=readme-ov-file#prepare-training-datasets)。 ## 引用若您的研究中用到了本工作，请引用如下文献： bibtex @misc{ge2024v2peimprovingmultimodallongcontext, title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding}, author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu}, year={2024}, eprint={2412.09616}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.09616}, }

应用场景：