VTCBench

Name: VTCBench
Creator: maas
Published: 2026-01-05 20:57:41
License: 暂无描述

魔搭社区2026-01-05 更新2026-01-10 收录

下载链接：

https://modelscope.cn/datasets/MLLM-CL/VTCBench

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center"> <a href="https://arxiv.org/abs/2512.15649"> <img src="https://img.shields.io/badge/2512.15649-B31B1B?logo=arxiv" alt="Arxiv: 2512.15649" /></a> <a href="https://huggingface.co/datasets/MLLM-CL/VTCBench"> <img src="https://img.shields.io/badge/Hugging_Face-FF8D28?logo=huggingface" alt="Hugging Face" /></a> <a href="https://modelscope.cn/datasets/MLLM-CL/VTCBench"> <img src="https://img.shields.io/badge/ModelScope-00AAEE?logo=data:image/svg%2bxml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPHN2ZyB2ZXJzaW9uPSIxLjEiIHdpZHRoPSIyNCIgaGVpZ2h0PSIxNCIgdmlld0JveD0iMCAwIDI0IDE0IiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8dGl0bGU+TW9kZWxTY29wZSBCYWRnZTwvdGl0bGU+CjxnIGZpbGw9Im5vbmUiIGZpbGwtcnVsZT0iZXZlbm9kZCI+CjxnIGZpbGwtcnVsZT0ibm9uemVybyI+CjxwYXRoIGQ9Im0wIDIuNjY3aDIuNjY3djIuNjY3aC0yLjY2N3YtMi42Njd6bTggMi42NjZoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3oiIGZpbGw9IiMzNkNFRDAiLz4KPHBhdGggZD0ibTAgNS4zMzNoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3ptMi42NjcgMi42NjdoMi42NjZ2Mi42NjdoMi42Njd2Mi42NjZoLTUuMzMzdi01LjMzM3ptMC04aDUuMzMzdjIuNjY3aC0yLjY2N3YyLjY2NmgtMi42NjZ2LTUuMzMzem04IDhoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3oiIGZpbGw9IiM2MjRBRkYiLz4KPHBhdGggZD0ibTI0IDIuNjY3aC0yLjY2N3YyLjY2N2gyLjY2N3YtMi42Njd6bS04IDIuNjY2aC0yLjY2N3YyLjY2N2gyLjY2N3YtMi42Njd6IiBmaWxsPSIjMzZDRUQwIi8+CjxwYXRoIGQ9Im0yNCA1LjMzM2gtMi42Njd2Mi42NjdoMi42Njd2LTIuNjY3em0tMi42NjcgMi42NjdoLTIuNjY2djIuNjY3aC0yLjY2N3YyLjY2Nmg1LjMzM3YtNS4zMzN6bTAtOGgtNS4zMzN2Mi42NjdoMi42Njd2Mi42NjZoMi42NjZ2LTUuMzMzeiIgZmlsbD0iIzYyNEFGRiIvPgo8L2c+CjwvZz4KPC9zdmc+Cg==" alt="ModelScope" /></a> <a href="https://creativecommons.org/licenses/by-nc/4.0/"> <img src="https://img.shields.io/badge/CC_BY--NC_4.0-ED592F?logo=creativecommons&logoColor=white" alt="License: CC BY-NC 4.0" /></a> <a href="./CITATION.cff"> <img src="https://img.shields.io/badge/CITATION-AC7F5E" alt="Citation" /></a> <a href="https://github.com/Moenupa/VTCBench"> <img src="https://img.shields.io/badge/Moenupa/VTCBench-2B3137?logo=github" alt="github.com/Moenupa/VTCBench" /></a> <a href="https://github.com/bjzhb666/VLMEvalKit"> <img src="https://img.shields.io/badge/bjzhb666/VLMEvalKit-2B3137?logo=github" alt="github.com/bjzhb666/VLMEvalKit" /></a> </p> # Dataset Card for VTCBench [**Vision-Text Compression Benchmark** (VTCBench)][homepage] revisits Needle-In-A-Haystack (NIAH) from a VLM's perspective by converting long context into rendered images. This benchmark tests VLM's ability to OCR, retrieve, aggregate, infer, and memorize long context as images. Specifically, this benchmark includes 3 tasks: - *Retrieval*: Vision-NIAH VQA task for information retrieval and aggregation. - *Reasoning*: Vision-NIAH VQA task for associative reasoning with general knowledge. - *Memory*: VQA task for memorizing and understanding long cohesive dialogues. [homepage]: https://moenupa.github.io/VTCBench ## Dataset Details  This repo contains the **wild version** of the VTCBench— a diverse, image-ready static VLM benchmark, featuring multiple fonts, font sizes, and line spacing, ready for direct evaluation without any dataset generation. Please refer to our [Github][ourgithub] for the full VTCBench with controllable text-to-image rendering + evaluation pipeline. [ourgithub]: https://github.com/moenupa/VTCBench ## Uses  ### Direct Use  Direct evaluation. ```python from datasets import load_dataset # problem: str # images: list[dict[str,bytes]], e.g., `[{"bytes": b'xxxxxx'}]` hf_dataset = load_dataset(MLLM-CL/VTCBench, columns=["problem", "answers", "images"]) # generate pred: str output = llm.generate(...) # evaluate against ground-truth on a `should-contain-all-gts` basis # answers: list[str] metric = contains_all(output, answers) ``` A simple metric example looks like: ```python # check if pred contains **ALL** of the gts def contains_all(pred: str, gts: list[str]) -> float: hits = sum(each_gt in pred for each_gt in gts) total = len(gts) return hits/total ``` ### Out-of-Scope Use  Regenerate data. We maintained metadata in columns starting with `_`. Specifically: - `_context: str` is the text-equivalent for `images` column, i.e., raw context before they are rendered into images, some may be HTML. - `_render_args: str` (dict-dumped string) controls the rendering operator, i.e., text-to-image. E.g., its `pagesize: tuple[int, int]` field adjusts image size (`pagesize=(512,512)` for `512x512`px images); its `css: str` field adjusts font sizes and spacing (`css="*{font-size:12px;}"` yields texts that are 12px). - `_source: str` (dict-dumped string) is row-level metadata containing things like what needle & haystack are, which in turn, controls how `_context` is generated. You may regenerate the images or the images-question-answers triplet entirely. You may refer to [how we generate images][ourgithub]. ## Dataset Creation ### Curation Rationale  NIAH like [RULER][gitruler] and [NoLiMa][gitnolima] provides flexibility—and therefore randomness—of the dataset: permutation of random needles and random haystacks, where vision-NIAH adds another layer of random rendering parameters on top of NIAH, making trouble for benchmarking and reproducing. We hope to mitigate randomness caused by the dataset by curating a **small-scale standard static VQA** benchmark—**VTCBench-Wild**, uniformly sampled from all the permutations stated above, to represent the whole VTCBench as much as possible. ### Source Data We generate VTCBench from classic NIAH datasets or long-term memory datasets. | VTCBench | Dataset | Metric | Needle | Haystack | Evaluated by | License | | :-----------: | :-----------------: | :-----------: | :--------------: | :-----------: | :-----------: | :----------------------------: | | VTC-Retrieval | [RULER][gitruler] | `contains` | word/uuid/number | essay | Completion/QA | [Apache-2.0][gitrulerLCS] | | VTC-Reasoning | [NoLiMa][gitnolima] | `containsAll` | character/event | book | QA | [Adobe Research][gitnolimaLCS] | | VTC-Memory | [LoCoMo][gitlocomo] | `ROUGE-L` | _NA_ | conversations | QA | [CC BY-NC 4.0][gitlocomoLCS] | [gitruler]: https://github.com/NVIDIA/RULER [gitrulerLCS]: https://github.com/NVIDIA/RULER/blob/main/LICENSE [gitnolima]: https://github.com/Adobe-Research/NoLiMa [gitnolimaLCS]: https://github.com/Adobe-Research/NoLiMa/blob/main/LICENSE [hfnolima]: https://huggingface.co/datasets/amodaresi/NoLiMa [gitlocomo]: https://github.com/snap-research/locomo [gitlocomoLCS]: https://github.com/snap-research/locomo/blob/main/LICENSE.txt #### Data Collection and Processing  Consider a data generation pipeline like this: - `stage1`: seeds (random needle, random haystack) - `stage2`: text context-with-question - `stage3`: images-with-question Transformations: - `operator1: stage1-->stage2`: random (needle, haystack) selection and placeholder filling. - `operator2: stage2-->stage3`: text-to-image (i.e., rendering by render_args). Since [RULER][gitruler] generates needles dynamically, we eliminate its randomness by manually pre-generating (and therefore pre-determining) our own text-form version in [our RULER repo](https://huggingface.co/datasets/MLLM-CL/RULER) that conforms to [NoLiMa][hfnolima] format. The other two have no randomness before stage1. After freezing results from stage1, we uniformly sample operators after permuting operator1 (2 DOF, needle and haystack) and operator2 (3 DOF, including font, font size, and line spacing), resulting in: - Retrieval: 800 examples - Reasoning: 800 examples - Memory: 600 examples ## Bias, Risks, and Limitations  1. The `problem` does not include any instruction prompt. You may refer to the original NIAH's implementation or our [evaluation framework](https://github.com/Moenupa/VTCBench/blob/7c6ca236bc5f9078db48bd63f89c1013f9270a26/examples/run_wild.py#L17-L39). 2. VTCBench-Wild is merely a subset of all rendering formats. We include permutations in 3 aspects `fonts={"Helvetica", "Times New Roman", "Courier New"}, font-size=[10,20], line-spacing={1,1.2,1.5}`, from which we sample a total of ~5k samples to form VTCBench-Wild. There is a much greater number of permutations in reality, but we accept this limitation and prioritize cost-effectiveness. ## Citation ``` @misc{zhao2025vtcbench, title={{VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?}}, author={Hongbo Zhao and Meng Wang and Fei Zhu and Wenzhuo Liu and Bolin Ni and Fanhu Zeng and Gaofeng Meng and Zhaoxiang Zhang}, year={2025}, eprint={2512.15649}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.15649}, } ```

<p align="center"> <a href="https://arxiv.org/abs/2512.15649"> <img src="https://img.shields.io/badge/2512.15649-B31B1B?logo=arxiv" alt="ArXiv: 2512.15649" /></a> <a href="https://huggingface.co/datasets/MLLM-CL/VTCBench"> <img src="https://img.shields.io/badge/Hugging_Face-FF8D28?logo=huggingface" alt="Hugging Face" /></a> <a href="https://modelscope.cn/datasets/MLLM-CL/VTCBench"> <img src="https://img.shields.io/badge/ModelScope-00AAEE?logo=data:image/svg%2bxml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPHN2ZyB2ZXJzaW9uPSIxLjEiIHdpZHRoPSIyNCIgaGVpZ2h0PSIxNCIgdmlld0JveD0iMCAwIDI0IDE0IiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8dGl0bGU+TW9kZWxTY29wZSBCYWRnZTwvdGl0bGU+CjxnIGZpbGw9Im5vbmUiIGZpbGwtcnVsZT0iZXZlbm9kZCI+CjxnIGZpbGwtcnVsZT0ibm9uemVybyI+CjxwYXRoIGQ9Im0wIDIuNjY3aDIuNjY3djIuNjY3aC0yLjY2N3YtMi42Njd6bTggMi42NjZoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3oiIGZpbGw9IiMzNkNFRDAiLz4KPHBhdGggZD0ibTAgNS4zMzNoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3ptMi42NjcgMi42NjdoMi42NjZ2Mi42NjdoMi42Njd2Mi42NjZoLTUuMzMzdi01LjMzM3ptMC04aDUuMzMzdjIuNjY3aC0yLjY2N3YyLjY2NmgtMi42NjZ2LTUuMzMzem04IDhoMi42Njd2Mi42NjdoLTIuNjY3di0yLjY2N3oiIGZpbGw9IiM2MjRBRkYiLz4KPHBhdGggZD0ibTI0IDIuNjY3aC0yLjY2N3YyLjY2N2gyLjY2N3YtMi42Njd6bS04IDIuNjY2aC0yLjY2N3YyLjY2N2gyLjY2N3YtMi42Njd6IiBmaWxsPSIjMzZDRUQwIi8+CjxwYXRoIGQ9Im0yNCA1LjMzM2gtMi42Njd2Mi42NjdoMi42Njd2LTIuNjY3em0tMi42NjcgMi42NjdoLTIuNjY2djIuNjY3aC0yLjY2N3YyLjY2Nmg1LjMzM3YtNS4zMzN6bTAtOGgtNS4zMzN2Mi42NjdoMi42Njd2Mi42NjZoMi42NjZ2LTUuMzMzeiIgZmlsbD0iIzYyNEFGRiIvPgo8L2c+CjwvZz4KPC9zdmc+Cg==" alt="ModelScope" /></a> <a href="https://creativecommons.org/licenses/by-nc/4.0/"> <img src="https://img.shields.io/badge/CC_BY--NC_4.0-ED592F?logo=creativecommons&logoColor=white" alt="授权协议：CC BY-NC 4.0" /></a> <a href="./CITATION.cff"> <img src="https://img.shields.io/badge/CITATION-AC7F5E" alt="引用" /></a> <a href="https://github.com/Moenupa/VTCBench"> <img src="https://img.shields.io/badge/Moenupa/VTCBench-2B3137?logo=github" alt="github.com/Moenupa/VTCBench" /></a> <a href="https://github.com/bjzhb666/VLMEvalKit"> <img src="https://img.shields.io/badge/bjzhb666/VLMEvalKit-2B3137?logo=github" alt="github.com/bjzhb666/VLMEvalKit" /></a> </p> # VTCBench 数据集卡片 [**视觉文本压缩基准测试集（Vision-Text Compression Benchmark, VTCBench）**][homepage] 从视觉语言模型（Vision-Language Model, VLM）的视角重新审视了「干草堆中的针（Needle-In-A-Haystack, NIAH）」任务，将长上下文转换为渲染后的图像。该基准测试用于评估视觉语言模型的光学字符识别（Optical Character Recognition, OCR）、信息检索、内容聚合、逻辑推理以及将长上下文以图像形式进行记忆的能力。具体而言，本基准测试包含3项任务： - *检索任务（Retrieval）*：用于信息检索与聚合的视觉版NIAH视觉问答（Visual Question Answering, VQA）任务。 - *推理任务（Reasoning）*：结合通用知识进行关联推理的视觉版NIAH视觉问答任务。 - *记忆任务（Memory）*：用于记忆并理解长连贯对话的视觉问答任务。 [homepage]: https://moenupa.github.io/VTCBench ## 数据集详情本仓库包含VTCBench的**野外版（wild version）**——一款多样化、适配图像输入的静态视觉语言模型基准测试集，支持多种字体、字号与行间距，无需额外数据集生成即可直接用于评估。如需获取支持可控文本转图像渲染+评估流水线的完整VTCBench，请参阅我们的[GitHub仓库][ourgithub]。 [ourgithub]: https://github.com/moenupa/VTCBench ## 数据集用途 ### 直接使用直接用于评估。 python from datasets import load_dataset # 问题：字符串类型 # 图像：字典列表，格式如 `[{"bytes": b'xxxxxx'}]` hf_dataset = load_dataset("MLLM-CL/VTCBench", columns=["problem", "answers", "images"]) # 生成预测结果：字符串类型 output = llm.generate(...) # 基于「预测结果应包含所有标准答案」的规则与基准真值进行评估 # 标准答案：字符串列表 metric = contains_all(output, answers) 一个简单的指标示例如下： python # 检查预测结果是否**包含所有**基准真值 def contains_all(pred: str, gts: list[str]) -> float: hits = sum(each_gt in pred for each_gt in gts) total = len(gts) return hits/total ### 超出适用范围重新生成数据。本数据集保留了以下划线`_`开头的元数据列： - `_context: str`：对应`images`列的文本等效内容，即渲染为图像前的原始上下文，部分内容为HTML格式。 - `_render_args: str`（经字典转储的字符串）：控制渲染操作，即文本转图像的参数。例如，其`pagesize: tuple[int, int]`字段用于调整图像尺寸（如`pagesize=(512,512)`代表512×512像素的图像）；其`css: str`字段用于调整字号与行间距（如`css="*{font-size:12px;}"`将生成字号为12px的文本）。 - `_source: str`（经字典转储的字符串）：行级元数据，包含如针与干草堆的具体信息，用于控制`_context`的生成方式。您可基于上述信息重新生成图像或完整的「图像-问题-答案」三元组，具体可参阅[我们的GitHub仓库][ourgithub]中关于图像生成的说明。 ## 数据集构建 ### 构建动因现有如[RULER][gitruler]与[NoLiMa][gitnolima]等基于干草堆中的针（NIAH）范式的数据集存在灵活性过强导致的随机性问题：随机选取针与干草堆，而视觉版NIAH在此基础上额外增加了渲染参数的随机性，给基准测试与结果复现带来了困难。我们希望通过构建**小规模标准化静态视觉问答基准测试集——VTCBench-Wild**来缓解上述随机性问题，该数据集从上述所有排列组合中均匀采样，尽可能完整地代表整个VTCBench数据集。 ### 源数据我们从经典的NIAH数据集或长期记忆数据集中生成VTCBench。 | VTCBench子任务 | 源数据集 | 评估指标 | 针（Needle） | 干草堆（Haystack） | 评估方式 | 授权协议 | | :-----------: | :-----------------: | :-----------: | :--------------: | :-----------: | :-----------: | :----------------------------: | | VTC-Retrieval | [RULER][gitruler] | `contains` | 单词/UUID/数字 | 短文 | 补全/视觉问答 | [Apache-2.0][gitrulerLCS] | | VTC-Reasoning | [NoLiMa][gitnolima] | `containsAll` | 字符/事件 | 书籍 | 视觉问答 | [Adobe Research][gitnolimaLCS] | | VTC-Memory | [LoCoMo][gitlocomo] | `ROUGE-L` | _NA_ | 对话数据 | 视觉问答 | [CC BY-NC 4.0][gitlocomoLCS] | [gitruler]: https://github.com/NVIDIA/RULER [gitrulerLCS]: https://github.com/NVIDIA/RULER/blob/main/LICENSE [gitnolima]: https://github.com/Adobe-Research/NoLiMa [gitnolimaLCS]: https://github.com/Adobe-Research/NoLiMa/blob/main/LICENSE [hfnolima]: https://huggingface.co/datasets/amodaresi/NoLiMa [gitlocomo]: https://github.com/snap-research/locomo [gitlocomoLCS]: https://github.com/snap-research/locomo/blob/main/LICENSE.txt #### 数据收集与处理我们采用如下数据生成流水线： 1. 阶段1：生成随机种子（随机选取针与干草堆） 2. 阶段2：生成带问题的文本上下文 3. 阶段3：生成带问题的图像其中的变换操作包括： - 操作1：阶段1→阶段2：随机选取（针，干草堆）组合并填充占位符。 - 操作2：阶段2→阶段3：文本转图像（即基于`render_args`进行渲染）。由于[RULER][gitruler]的针是动态生成的，我们通过在[我们的RULER数据集仓库](https://huggingface.co/datasets/MLLM-CL/RULER)中手动预先生成符合[NoLiMa][hfnolima]格式的文本版本，消除了其随机性。其余两个数据集在阶段1前不存在随机性。在固定阶段1的结果后，我们对操作1（2个自由度：针与干草堆）与操作2（3个自由度：包括字体、字号与行间距）进行排列组合后均匀采样，最终得到： - 检索任务：800个样本 - 推理任务：800个样本 - 记忆任务：600个样本 ## 偏差、风险与局限性 1. `problem`字段未包含任何指令提示词，您可参考原始NIAH的实现方式或我们的[评估框架](https://github.com/Moenupa/VTCBench/blob/7c6ca236bc5f9078db48bd63f89c1013f9270a26/examples/run_wild.py#L17-L39)。 2. VTCBench-Wild仅为所有渲染格式的一个子集。我们涵盖了3个维度的排列组合：`fonts={"Helvetica", "Times New Roman", "Courier New"}, font-size=[10,20], line-spacing={1,1.2,1.5}`，并从中采样了约5000个样本组成VTCBench-Wild。现实中存在更多的排列组合，但我们接受这一局限性，并优先考虑成本效益。 ## 引用 @misc{zhao2025vtcbench, title={{VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?}}, author={Hongbo Zhao and Meng Wang and Fei Zhu and Wenzhuo Liu and Bolin Ni and Fanhu Zeng and Gaofeng Meng and Zhaoxiang Zhang}, year={2025}, eprint={2512.15649}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.15649}, }

提供机构：

maas

创建时间：

2025-12-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集