UVRB
收藏魔搭社区2026-05-16 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/iic/UVRB
下载链接
链接失效反馈官方服务:
资源简介:
# 🌐 Universal Video Retrieval Benchmark (UVRB)
> **The first comprehensive benchmark for universal video retrieval**
> Evaluate your model across **16 datasets**, **3 query types**, and **6 capability dimensions** — not just accuracy, but *why* it succeeds or fails.
UVRB is a comprehensive evaluation suite designed to **diagnose and quantify** a video embedding model’s true generalization ability — beyond narrow text-to-video tasks. It exposes critical gaps in spatial reasoning, temporal dynamics, compositional understanding, and long-context retrieval that traditional benchmarks (e.g., MSRVTT) completely miss.
---
## 📊 Benchmark Structure
UVRB evaluates **9 core abilities** across **16 datasets**:
### 🔹 By Query Type
- **TXT**: Text-to-Video (e.g., MSRVTT, CRB-T)
- **CMP**: Composed Query (Text + Image/Video → Video) (e.g., MS-TI, MS-TV)
- **VIS**: Visual Query (Image/Clip → Video) (e.g., MSRVTT-I2V, LoVR-C2V)
### 🔹 By Data Domain
- **CG**: Coarse-grained (high-level semantics)
- **FG**: Fine-grained
- **S**: Spatial (object appearance & layout)
- **T**: Temporal (event dynamics & sequence)
- **PR**: Partially Relevant (keywords, themes, abstract cues)
- **LC**: Long-context (videos > 10 mins, captions > 1K words)
---
## 📥 Dataset Overview
### Statistics of Datasets in UVRB
All videos use **8 uniformly sampled frames**.
- **# Query**: number of queries
- **# Corpus**: number of corpus items
- **Dur (s)**: average video duration in seconds
- **# Word**: average text length in words (`-` means no text)
| Dataset | # Query | # Corpus | Dur (s) | # Word |
|--------|--------:|---------:|--------:|-------:|
| **Textual Video Retrieval (Coarse-grained)** | | | | |
| MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 |
| DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 |
| CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 |
| **Textual Video Retrieval (Fine-grained)** | | | | |
| *(a) Spatial* | | | | |
| CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 |
| VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 |
| *(b) Temporal* | | | | |
| CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 |
| CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 |
| *(c) Partially Relevant* | | | | |
| DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 |
| LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 |
| PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 |
| **Textual Video Retrieval (Long-context)** | | | | |
| LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 |
| VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 |
| **Composed Video Retrieval** | | | | |
| MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 |
| MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 |
| **Visual Video Retrieval** | | | | |
| MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – |
| LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – |
> ✅ All datasets use **8 uniformly sampled frames**
> ✅ No audio, speech, or metadata — pure vision-language evaluation
---
## 🛠️ How to Use
For the folder of each dataset, there are two or three sub-folders:
- **jsonl**: the original dataset files with `jsonl` format
- `corpus.jsonl`: the corpus items
- `queries.jsonl`: the query items
- `instances.jsonl`: the matching relationships between queries and corpus items
- **videos**: the video files of corpus candidates (p.s., and query clips for LoVR-C2V)
- **images** (only for text-image-to-video and image-to-video tasks): the image files of query items
---
## 📚 Citation
```bibtex
@misc{guo2025gve,
title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
year={2025},
eprint={2510.27571},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.27571},
}
```
# 🌐 通用视频检索基准数据集(Universal Video Retrieval Benchmark, UVRB)
> **首个面向通用视频检索的综合基准测试套件**
> 可在**16个数据集**、**3种查询类型**与**6项能力维度**下对模型进行评估——不仅考量准确率,更能探究模型成败的内在原因。
UVRB是一款综合性评估套件,旨在**诊断并量化**视频嵌入模型的真实泛化能力,而非局限于狭义的文本到视频任务。该基准能够揭示传统基准(如MSRVTT)完全忽略的空间推理、时间动态、组合理解以及长上下文检索等关键短板。
---
## 📊 基准测试架构
UVRB会基于**16个数据集**对**9项核心能力**进行评估:
### 🔹 按查询类型划分
- **TXT**:文本到视频(Text-to-Video,如MSRVTT、CRB-T)
- **CMP**:组合查询(Composed Query,即文本+图像/视频→视频,如MS-TI、MS-TV)
- **VIS**:视觉查询(Visual Query,即图像/片段→视频,如MSRVTT-I2V、LoVR-C2V)
### 🔹 按数据域划分
- **CG**:粗粒度(Coarse-grained,高层语义)
- **FG**:细粒度(Fine-grained)
- **S**:空间维度(Spatial,即物体外观与布局)
- **T**:时间维度(Temporal,即事件动态与序列)
- **PR**:部分相关(Partially Relevant,即关键词、主题与抽象线索)
- **LC**:长上下文(Long-context,即视频时长超过10分钟,字幕长度超过1000词)
---
## 📥 数据集总览
### UVRB中数据集的统计信息
所有视频均采用**8帧均匀采样**。
- **# Query**:查询样本数量
- **# Corpus**:语料库样本数量
- **Dur (s)**:视频平均时长(单位:秒)
- **# Word**:文本的平均词数(`-`表示无对应文本)
| 数据集 | # Query | # Corpus | Dur (s) | # Word |
|--------|--------:|---------:|--------:|-------:|
| **文本视频检索(粗粒度)** | | | | |
| MSRVTT | 1,000 | 1,000 | 15.0 | 9.4 |
| DiDeMo | 1,004 | 1,004 | 53.9 | 29.1 |
| CaReBench-General (CRB-G) | 1,000 | 1,000 | 14.4 | 232.2 |
| **文本视频检索(细粒度)** | | | | |
| (a) 空间维度 | | | | |
| CaReBench-Spatial (CRB-S) | 1,000 | 1,000 | 14.4 | 115.0 |
| VDC-Object (VDC-O) | 1,027 | 1,027 | 30.1 | 91.4 |
| (b) 时间维度 | | | | |
| CaReBench-Temporal (CRB-T) | 1,000 | 1,000 | 14.4 | 103.2 |
| CameraBench (CMRB) | 728 | 1,071 | 5.7 | 24.8 |
| (c) 部分相关 | | | | |
| DREAM-1K-Event (DREAM-E) | 6,251 | 1,000 | 8.8 | 6.5 |
| LoVR-Theme2Clip (LoVR-TH) | 8,854 | 8,854 | 16.9 | 48.1 |
| PE-Video-Keyword (PEV-K) | 14,427 | 15,000 | 16.9 | 45.5 |
| **文本视频检索(长上下文)** | | | | |
| LoVR-Text2Video (LoVR-V) | 100 | 467 | 1,560.3 | 17,364.5 |
| VDC-Detail (VDC-D) | 1,000 | 1,027 | 30.1 | 508.0 |
| **组合式视频检索** | | | | |
| MomentSeeker-Text-Image (MS-TI) | 400 | 10 | 13.5 | 68.5 |
| MomentSeeker-Text-Video (MS-TV) | 400 | 10 | 13.5 | 68.5 |
| **视觉式视频检索** | | | | |
| MSRVTT-ImageVideo (MSRVTT-I2V) | 1,000 | 1,000 | 15.0 | – |
| LoVR-Clip-to-Video (LoVR-C2V) | 467 | 467 | 1,560.3 | – |
> ✅ 所有数据集均采用**8帧均匀采样**
> ✅ 无音频、语音或元数据,仅用于纯视觉-语言评估
---
## 🛠️ 使用指南
每个数据集的文件夹包含2或3个子文件夹:
- **jsonl**:采用`jsonl`格式的原始数据集文件
- `corpus.jsonl`:语料库条目
- `queries.jsonl`:查询条目
- `instances.jsonl`:查询与语料库条目间的匹配关系
- **videos**:候选语料的视频文件(注:LoVR-C2V任务还包含查询片段)
- **images**:(仅适用于文本-图像到视频、图像到视频任务)查询样本的图像文件
---
## 📚 引用
bibtex
@misc{guo2025gve,
title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
year={2025},
eprint={2510.27571},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.27571},
}
提供机构:
maas
创建时间:
2025-11-01



