UVRB

Name: UVRB
Creator: maas
Published: 2026-05-16 10:43:17
License: 暂无描述

魔搭社区2026-05-16 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/iic/UVRB

下载链接

链接失效反馈

官方服务：

资源简介：

# 🌐 Universal Video Retrieval Benchmark (UVRB) > **The first comprehensive benchmark for universal video retrieval** > Evaluate your model across **16 datasets**, **3 query types**, and **6 capability dimensions** — not just accuracy, but *why* it succeeds or fails. UVRB is a comprehensive evaluation suite designed to **diagnose and quantify** a video embedding model’s true generalization ability — beyond narrow text-to-video tasks. It exposes critical gaps in spatial reasoning, temporal dynamics, compositional understanding, and long-context retrieval that traditional benchmarks (e.g., MSRVTT) completely miss. --- ## 📊 Benchmark Structure UVRB evaluates **9 core abilities** across **16 datasets**: ### 🔹 By Query Type - **TXT**: Text-to-Video (e.g., MSRVTT, CRB-T) - **CMP**: Composed Query (Text + Image/Video → Video) (e.g., MS-TI, MS-TV) - **VIS**: Visual Query (Image/Clip → Video) (e.g., MSRVTT-I2V, LoVR-C2V) ### 🔹 By Data Domain - **CG**: Coarse-grained (high-level semantics) - **FG**: Fine-grained - **S**: Spatial (object appearance & layout) - **T**: Temporal (event dynamics & sequence) - **PR**: Partially Relevant (keywords, themes, abstract cues) - **LC**: Long-context (videos > 10 mins, captions > 1K words) --- ## 📥 Dataset Overview ### Statistics of Datasets in UVRB All videos use **8 uniformly sampled frames**. - **# Query**: number of queries - **# Corpus**: number of corpus items - **Dur (s)**: average video duration in seconds - **# Word**: average text length in words (`-` means no text) | Dataset | # Query | # Corpus | Dur (s) | # Word | |--------|--------:|---------:|--------:|-------:| | **Textual Video Retrieval (Coarse-grained)** | | | | | | MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 | | DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 | | CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 | | **Textual Video Retrieval (Fine-grained)** | | | | | |   *(a) Spatial* | | | | | | CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 | | VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 | |   *(b) Temporal* | | | | | | CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 | | CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 | |   *(c) Partially Relevant* | | | | | | DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 | | LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 | | PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 | | **Textual Video Retrieval (Long-context)** | | | | | | LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 | | VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 | | **Composed Video Retrieval** | | | | | | MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 | | MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 | | **Visual Video Retrieval** | | | | | | MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – | | LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – | > ✅ All datasets use **8 uniformly sampled frames** > ✅ No audio, speech, or metadata — pure vision-language evaluation --- ## 🛠️ How to Use For the folder of each dataset, there are two or three sub-folders: - **jsonl**: the original dataset files with `jsonl` format - `corpus.jsonl`: the corpus items - `queries.jsonl`: the query items - `instances.jsonl`: the matching relationships between queries and corpus items - **videos**: the video files of corpus candidates (p.s., and query clips for LoVR-C2V) - **images** (only for text-image-to-video and image-to-video tasks): the image files of query items --- ## 📚 Citation ```bibtex @misc{guo2025gve, title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum}, author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu}, year={2025}, eprint={2510.27571}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.27571}, } ```

# 🌐 通用视频检索基准数据集（Universal Video Retrieval Benchmark, UVRB） > **首个面向通用视频检索的综合基准测试套件** > 可在**16个数据集**、**3种查询类型**与**6项能力维度**下对模型进行评估——不仅考量准确率，更能探究模型成败的内在原因。 UVRB是一款综合性评估套件，旨在**诊断并量化**视频嵌入模型的真实泛化能力，而非局限于狭义的文本到视频任务。该基准能够揭示传统基准（如MSRVTT）完全忽略的空间推理、时间动态、组合理解以及长上下文检索等关键短板。 --- ## 📊 基准测试架构 UVRB会基于**16个数据集**对**9项核心能力**进行评估： ### 🔹 按查询类型划分 - **TXT**：文本到视频（Text-to-Video，如MSRVTT、CRB-T） - **CMP**：组合查询（Composed Query，即文本+图像/视频→视频，如MS-TI、MS-TV） - **VIS**：视觉查询（Visual Query，即图像/片段→视频，如MSRVTT-I2V、LoVR-C2V） ### 🔹 按数据域划分 - **CG**：粗粒度（Coarse-grained，高层语义） - **FG**：细粒度（Fine-grained） - **S**：空间维度（Spatial，即物体外观与布局） - **T**：时间维度（Temporal，即事件动态与序列） - **PR**：部分相关（Partially Relevant，即关键词、主题与抽象线索） - **LC**：长上下文（Long-context，即视频时长超过10分钟，字幕长度超过1000词） --- ## 📥 数据集总览 ### UVRB中数据集的统计信息所有视频均采用**8帧均匀采样**。 - **# Query**：查询样本数量 - **# Corpus**：语料库样本数量 - **Dur (s)**：视频平均时长（单位：秒） - **# Word**：文本的平均词数（`-`表示无对应文本） | 数据集 | # Query | # Corpus | Dur (s) | # Word | |--------|--------:|---------:|--------:|-------:| | **文本视频检索（粗粒度）** | | | | | | MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 | | DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 | | CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 | | **文本视频检索（细粒度）** | | | | | |   (a) 空间维度 | | | | | | CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 | | VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 | |   (b) 时间维度 | | | | | | CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 | | CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 | |   (c) 部分相关 | | | | | | DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 | | LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 | | PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 | | **文本视频检索（长上下文）** | | | | | | LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 | | VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 | | **组合式视频检索** | | | | | | MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 | | MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 | | **视觉式视频检索** | | | | | | MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – | | LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – | > ✅ 所有数据集均采用**8帧均匀采样** > ✅ 无音频、语音或元数据，仅用于纯视觉-语言评估 --- ## 🛠️ 使用指南每个数据集的文件夹包含2或3个子文件夹： - **jsonl**：采用`jsonl`格式的原始数据集文件 - `corpus.jsonl`：语料库条目 - `queries.jsonl`：查询条目 - `instances.jsonl`：查询与语料库条目间的匹配关系 - **videos**：候选语料的视频文件（注：LoVR-C2V任务还包含查询片段） - **images**：（仅适用于文本-图像到视频、图像到视频任务）查询样本的图像文件 --- ## 📚 引用 bibtex @misc{guo2025gve, title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum}, author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu}, year={2025}, eprint={2510.27571}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.27571}, }

提供机构：

maas

创建时间：

2025-11-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集