Alibaba-NLP/xvbench

Name: Alibaba-NLP/xvbench
Creator: Alibaba-NLP
Published: 2026-04-27 08:15:59
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/Alibaba-NLP/xvbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - question-answering language: - en tags: - multimodal - video - howto100m - retrieval-augmented-generation - visual-question-answering - cross-video-understanding size_categories: - n<1K configs: - config_name: default data_files: - split: test path: xvbench.jsonl --- # XVBench <a href="https://arxiv.org/pdf/2602.12735v1" target="_blank"><img src=https://img.shields.io/badge/arXiv-paper_VimRAG-red></a> <a href="https://huggingface.co/collections/Alibaba-NLP/vrag" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VRAG_Collection-blue></a> XVBench is a benchmark for evaluating multimodal retrieval-augmented generation systems on cross-video understanding. It is introduced alongside the paper *VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph*. The questions in XVBench are created based on videos from [HowTo100M](https://www.di.ens.fr/willow/research/howto100m/), a large-scale corpus of narrated instructional videos. The benchmark focuses on questions that require models or agents to retrieve and reason over video evidence from this large video corpus. Compared with single-video QA, XVBench is designed to test whether a system can find the relevant visual clips, preserve fine-grained visual details, and answer questions that depend on information distributed across video segments. ## Dataset Summary - **Task:** open-ended question answering over large video corpus - **Language:** English - **Source video corpus:** [HowTo100M](https://www.di.ens.fr/willow/research/howto100m/) - **License:** CC BY 4.0 Each example contains a question, a ground-truth answer, the source video identifier, and one or more reference video clips that support the answer. ## Download XVBench You can load the annotation file directly with `datasets` or clone the Hugging Face dataset repository directly: ```bash git lfs install git clone https://huggingface.co/datasets/Alibaba-NLP/XVBench ``` ## Download HowTo100M Video Corpus XVBench provides question-answer annotations and reference clip filenames. To use the benchmark with the original video evidence, please download the corresponding source videos from the [HowTo100M official website](https://www.di.ens.fr/willow/research/howto100m/) and follow the HowTo100M usage instructions. The `video_name` and `reference` fields in `xvbench.jsonl` are used to identify the source video and supporting clips. ## Video Preprocess We provide `split_video.sh`, a simplified video preprocessing script following the same fixed-duration splitting strategy as the VRAG video corpus pipeline. It converts non-MP4 videos to temporary MP4 files when needed, splits each source video into 60-second clips by default, and writes filenames in the same format as the `reference` field: `video_id_____1.mp4`, `video_id_____2.mp4`, and so on. ```bash ./split_video.sh -i /path/to/howto100m/videos -o ./video_chunks -d 60 ``` ## Dataset Structure The dataset is provided as `xvbench.jsonl`. Each line is a JSON object with the following fields: | Field | Type | Description | | --- | --- | --- | | `qid` | string | Unique question identifier. | | `query` | string | Natural-language question. | | `gt` | string | Ground-truth answer. | | `video_name` | string | Identifier of the source video. | | `reference` | list[string] | Supporting video clip filename(s) for the question. | ## Citation If you use this dataset, please cite the accompanying paper: ```bibtex @article{wang2026vimrag, title={VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph}, author={Wang, Qiuchen and Wang, Shihang and Zeng, Yu and Zhang, Qiang and Zhang, Fanrui and Guo, Zhuoning and Zhang, Bosi and Huang, Wenxuan and Chen, Lin and Chen, Zehui and others}, journal={arXiv preprint arXiv:2602.12735}, year={2026} } ```

提供机构：

Alibaba-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集