CaReBench
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/MCG-NJU/CaReBench
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1 style="margin: 0">
<img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo">
<span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval</b></span>
</h1>
<p style="margin: 0">
Yifan Xu, <a href="https://scholar.google.com/citations?user=evR3uR0AAAAJ">Xinhao Li</a>, Yichun Yang, Desen Meng, Rui Huang, <a href="https://scholar.google.com/citations?user=HEuN8PcAAAAJ">Limin Wang</a>
</p>
<p align="center">
🤗 <a href="https://huggingface.co/MCG-NJU/CaRe-7B">Model</a>    |    🤗 <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">Data</a>   |    📑 <a href="https://arxiv.org/pdf/2501.00513">Paper</a>   
</p>
</div>

## 📝 Introduction
**🌟 CaReBench** is a fine-grained benchmark comprising **1,000 high-quality videos** with detailed human-annotated captions, including **manually separated spatial and temporal descriptions** for independent spatiotemporal bias evaluation.

**📊 ReBias and CapST Metrics** are designed specifically for retrieval and captioning tasks, providing a comprehensive evaluation framework for spatiotemporal understanding in video-language models.
**⚡ CaRe: A Unified Baseline** for fine-grained video retrieval and captioning, achieving competitive performance through **two-stage Supervised Fine-Tuning (SFT)**. CaRe excels in both generating detailed video descriptions and extracting robust video features.

**🚀 State-of-the-art performance** on both detailed video captioning and fine-grained video retrieval. CaRe outperforms CLIP-based retrieval models and popular MLLMs in captioning tasks.

<div align="center">
<h1 style="margin: 0">
<img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo">
<span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench:面向视频字幕生成与检索的细粒度基准测试集</b></span>
</h1>
<p style="margin: 0">
徐一帆,李鑫浩,杨逸春,孟德森,黄锐,王利民
</p>
<p align="center">
🤗 <a href="https://huggingface.co/MCG-NJU/CaRe-7B">模型</a>    |    🤗 <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">数据集</a>   |    📑 <a href="https://arxiv.org/pdf/2501.00513">论文</a>   
</p>
</div>

## 📝 引言
**🌟 CaReBench** 是一个细粒度基准测试集,包含**1000个高质量视频**及详细的人工标注字幕,其中包含**手动分离的空间与时间描述**,可用于独立的时空偏倚评估。

**📊 ReBias与CapST指标** 专为检索与字幕生成任务设计,为视频语言模型的时空理解能力提供了全面的评估框架。
**⚡ CaRe:细粒度视频检索与字幕生成的统一基线模型**,通过**两阶段监督微调(Supervised Fine-Tuning, SFT)**实现了极具竞争力的性能。CaRe在生成精细视频描述与提取鲁棒视频特征两方面均表现出色。

**🚀 CaRe在细粒度视频字幕生成与视频检索任务上均达到了当前最优性能**:其性能优于基于CLIP的检索模型,且在字幕生成任务上超越了主流多模态大语言模型(Multimodal Large Language Model, MLLM)。

提供机构:
maas
创建时间:
2025-12-04



