five

UVRB

收藏
魔搭社区2026-05-16 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/iic/UVRB
下载链接
链接失效反馈
官方服务:
资源简介:
# 🌐 Universal Video Retrieval Benchmark (UVRB) > **The first comprehensive benchmark for universal video retrieval** > Evaluate your model across **16 datasets**, **3 query types**, and **6 capability dimensions** — not just accuracy, but *why* it succeeds or fails. UVRB is a comprehensive evaluation suite designed to **diagnose and quantify** a video embedding model’s true generalization ability — beyond narrow text-to-video tasks. It exposes critical gaps in spatial reasoning, temporal dynamics, compositional understanding, and long-context retrieval that traditional benchmarks (e.g., MSRVTT) completely miss. --- ## 📊 Benchmark Structure UVRB evaluates **9 core abilities** across **16 datasets**: ### 🔹 By Query Type - **TXT**: Text-to-Video (e.g., MSRVTT, CRB-T) - **CMP**: Composed Query (Text + Image/Video → Video) (e.g., MS-TI, MS-TV) - **VIS**: Visual Query (Image/Clip → Video) (e.g., MSRVTT-I2V, LoVR-C2V) ### 🔹 By Data Domain - **CG**: Coarse-grained (high-level semantics) - **FG**: Fine-grained - **S**: Spatial (object appearance & layout) - **T**: Temporal (event dynamics & sequence) - **PR**: Partially Relevant (keywords, themes, abstract cues) - **LC**: Long-context (videos > 10 mins, captions > 1K words) --- ## 📥 Dataset Overview ### Statistics of Datasets in UVRB All videos use **8 uniformly sampled frames**. - **# Query**: number of queries - **# Corpus**: number of corpus items - **Dur (s)**: average video duration in seconds - **# Word**: average text length in words (`-` means no text) | Dataset | # Query | # Corpus | Dur (s) | # Word | |--------|--------:|---------:|--------:|-------:| | **Textual Video Retrieval (Coarse-grained)** | | | | | | MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 | | DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 | | CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 | | **Textual Video Retrieval (Fine-grained)** | | | | | |   *(a) Spatial* | | | | | | CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 | | VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 | |   *(b) Temporal* | | | | | | CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 | | CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 | |   *(c) Partially Relevant* | | | | | | DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 | | LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 | | PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 | | **Textual Video Retrieval (Long-context)** | | | | | | LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 | | VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 | | **Composed Video Retrieval** | | | | | | MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 | | MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 | | **Visual Video Retrieval** | | | | | | MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – | | LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – | > ✅ All datasets use **8 uniformly sampled frames** > ✅ No audio, speech, or metadata — pure vision-language evaluation --- ## 🛠️ How to Use For the folder of each dataset, there are two or three sub-folders: - **jsonl**: the original dataset files with `jsonl` format - `corpus.jsonl`: the corpus items - `queries.jsonl`: the query items - `instances.jsonl`: the matching relationships between queries and corpus items - **videos**: the video files of corpus candidates (p.s., and query clips for LoVR-C2V) - **images** (only for text-image-to-video and image-to-video tasks): the image files of query items --- ## 📚 Citation ```bibtex @misc{guo2025gve, title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum}, author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu}, year={2025}, eprint={2510.27571}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.27571}, } ```

# 🌐 通用视频检索基准数据集(Universal Video Retrieval Benchmark, UVRB) > **首个面向通用视频检索的综合基准测试套件** > 可在**16个数据集**、**3种查询类型**与**6项能力维度**下对模型进行评估——不仅考量准确率,更能探究模型成败的内在原因。 UVRB是一款综合性评估套件,旨在**诊断并量化**视频嵌入模型的真实泛化能力,而非局限于狭义的文本到视频任务。该基准能够揭示传统基准(如MSRVTT)完全忽略的空间推理、时间动态、组合理解以及长上下文检索等关键短板。 --- ## 📊 基准测试架构 UVRB会基于**16个数据集**对**9项核心能力**进行评估: ### 🔹 按查询类型划分 - **TXT**:文本到视频(Text-to-Video,如MSRVTT、CRB-T) - **CMP**:组合查询(Composed Query,即文本+图像/视频→视频,如MS-TI、MS-TV) - **VIS**:视觉查询(Visual Query,即图像/片段→视频,如MSRVTT-I2V、LoVR-C2V) ### 🔹 按数据域划分 - **CG**:粗粒度(Coarse-grained,高层语义) - **FG**:细粒度(Fine-grained) - **S**:空间维度(Spatial,即物体外观与布局) - **T**:时间维度(Temporal,即事件动态与序列) - **PR**:部分相关(Partially Relevant,即关键词、主题与抽象线索) - **LC**:长上下文(Long-context,即视频时长超过10分钟,字幕长度超过1000词) --- ## 📥 数据集总览 ### UVRB中数据集的统计信息 所有视频均采用**8帧均匀采样**。 - **# Query**:查询样本数量 - **# Corpus**:语料库样本数量 - **Dur (s)**:视频平均时长(单位:秒) - **# Word**:文本的平均词数(`-`表示无对应文本) | 数据集 | # Query | # Corpus | Dur (s) | # Word | |--------|--------:|---------:|--------:|-------:| | **文本视频检索(粗粒度)** | | | | | | MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 | | DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 | | CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 | | **文本视频检索(细粒度)** | | | | | |   (a) 空间维度 | | | | | | CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 | | VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 | |   (b) 时间维度 | | | | | | CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 | | CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 | |   (c) 部分相关 | | | | | | DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 | | LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 | | PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 | | **文本视频检索(长上下文)** | | | | | | LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 | | VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 | | **组合式视频检索** | | | | | | MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 | | MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 | | **视觉式视频检索** | | | | | | MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – | | LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – | > ✅ 所有数据集均采用**8帧均匀采样** > ✅ 无音频、语音或元数据,仅用于纯视觉-语言评估 --- ## 🛠️ 使用指南 每个数据集的文件夹包含2或3个子文件夹: - **jsonl**:采用`jsonl`格式的原始数据集文件 - `corpus.jsonl`:语料库条目 - `queries.jsonl`:查询条目 - `instances.jsonl`:查询与语料库条目间的匹配关系 - **videos**:候选语料的视频文件(注:LoVR-C2V任务还包含查询片段) - **images**:(仅适用于文本-图像到视频、图像到视频任务)查询样本的图像文件 --- ## 📚 引用 bibtex @misc{guo2025gve, title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum}, author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu}, year={2025}, eprint={2510.27571}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.27571}, }
提供机构:
maas
创建时间:
2025-11-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作