下载链接：

https://modelscope.cn/datasets/DAMO-NLP-SG/VideoRefer-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/> <p> <h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24"> VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3> <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoRefer">Github</a> for the latest update. </h2> `VideoRefer-Bench` is a comprehensive benchmark to evaluate the object-level video understanding capabilities of a model, which consists of two sub-benchmarks: `VideoRefer-Bench-D` and `VideoRefer-Bench-Q`. <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/u3rzhi3u5ST1Me3mZy5Pp.png" width="100%" style="margin-bottom: 0.2;"/> <p> ## VideoRefer-Bench-D The benchmark is designed to evaluate the description generation performance of video-based referring models. The benchmark comprises a total of 400 curated data entries. We curated the test set based on Panda-70M, employing the automatic pipeline, followed by a meticulous human check. This benchmark covers four key aspects: 1. **Subject Correspondence (SC)**: This dimension evaluates whether the subject of the generated description accurately corresponds to that specified in the ground truth. 2. **Appearance Description (AD)**: This criterion assesses the accuracy of appearance-related details, including color, shape, texture, and other relevant visual attributes. 3. **Temporal Description (TD)**: This aspect analyzes whether the representation of the object’s motion is consistent with the actual movements. 4. **Hallucination Detection (HD)**: This facet identifies discrepancies by determining if the generated description includes any facts, actions, or elements absent from reality, like imaginative interpretations or incorrect inferences. | Type | GPT-4o | InternVL2-26B | Qwen2-VL-7B | Elysium | Artemis | VideoRefer | | ---------------------- | ------------- | ------------- | ----------- | ---------- | ------- | ----------------- | | Subject Correspondence | 3.34/4.15 | 3.55/4.08 | 2.97/3.30 | 2.35/- | -/3.42 | **4.41**/**4.44** | | Appearance Description | 2.96/**3.31** | 2.99/3.35 | 2.24/2.54 | 0.30/- | -/1.34 | **3.27**/3.27 | | Temporal Description | 3.01/**3.11** | 2.57/3.08 | 2.03/2.22 | 0.02/- | -/1.39 | **3.03**/3.10 | | Hallucinaton Detection | 2.50/2.43 | 2.25/2.28 | 2.31/2.12 | **3.59**/- | -/2.90 | 2.97/**3.04** | | Average | 2.95/3.25 | 2.84/3.20 | 2.39/2.55 | 1.57/- | -/2.26 | **3.42**/**3.46** | ### Data Format For each object, we uniformly sampled 32 frames to generate the corresponding mask. The data format organized in the benchmark json file is as below: ```json [ { "id": 0, "video": "rLlzmcp3J6s_0:01:09.633_0:01:14.333.mp4", "caption": "The cub is a smaller, light colored lion. It is lying down and resting its head against the other lion. The cub looks calm and relaxed. It is the lion on the far left side of the frame.", "frame_idx": "36", "annotation":[ { "2":{ "segmentation": { } }, "6":{ "segmentation": { } }, ... } ] } ] ``` - `frame_idx`: When using single-frame mask mode, we only use the single mask with the frame_idx. - All the segmentations are in `RLE` format. ## VideoRefer-Bench-Q The benchmark is designed to evaluate the proficiency of MLLMs in interpreting video objects, including 1,000 high-quality multiple-choice questions. The benchmark covers five types of questions: 1. Basic Questions 2. Sequential Questions 3. Relationship Questions 4. Reasoning Questions 5. Future Predictions | Type | GPT-4o | GPT-4o-mini | InternVL2-26B | Qwen2-VL-7B | VideoRefer | | ---------------------- | -------- | ----------- | ------------- | ----------- | ---------- | | Basic Questions | 62.3 | 57.6 | 58.5 | 62.0 | **75.4** | | Sequential Questions | **74.5** | 67.1 | 63.5 | 69.6 | 68.6 | | Relationship Questions | **66.0** | 56.5 | 53.4 | 54.9 | 59.3 | | Reasoning Questions | 88.0 | 85.9 | 88.0 | 87.3 | **89.4** | | Reasoning Questions | 73.7 | 75.4 | **78.9** | 74.6 | 78.1 | | Average | 71.3 | 65.8 | 65.0 | 66.0 | **71.9** | ### Data Format For each object, we uniformly sampled 32 frames to generate the corresponding mask. The data format organized in the benchmark json file is as below: ```json [ { "id": 0, "video": "DAVIS/JPEGImages/480p/aerobatics", "Question": "What is <object3><region> not wearing?", "type": "Basic Questions", "options": [ "(A) A helmet", "(B) A hat", "(C) Sunglasses", "(D) A watch" ], "Answer": "(A) A helmet", "frame_idx": "57", "annotation":[ { "0":{ "segmentation": { } }, "3":{ "segmentation": { } }, ... } ] } ] ``` - `frame_idx`: When using single-frame mask mode, we only use the single mask with the frame_idx. - All the segmentations are in `RLE` format.

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/> <p> <h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24">VideoRefer套件：借助视频大语言模型（Video LLM）推进时空目标理解</a></h3> <h5 align="center"> 如果您喜爱我们的项目，请前往 <a href="https://github.com/DAMO-NLP-SG/VideoRefer">GitHub</a> 为我们点亮星标⭐以获取最新更新。 </h2> `VideoRefer-Bench` 是一款用于评估模型目标级视频理解能力的综合基准测试集，包含两个子基准：`VideoRefer-Bench-D` 与 `VideoRefer-Bench-Q`。 <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/u3rzhi3u5ST1Me3mZy5Pp.png" width="100%" style="margin-bottom: 0.2;"/> <p> ## VideoRefer-Bench-D ### VideoRefer基准测试集D版该基准旨在评估基于视频的指代模型的描述生成性能，共包含400条精心筛选的数据条目。我们以Panda-70M为基础构建测试集，先通过自动化流程生成候选数据，再经过严格的人工审核完成最终筛选。该基准覆盖四大核心评价维度： 1. **主体匹配度（Subject Correspondence, SC）**：评估生成描述的主体是否与标注真值中指定的对象准确对应。 2. **外观描述精度（Appearance Description, AD）**：衡量与外观相关细节的准确性，涵盖颜色、形状、纹理及其他相关视觉属性。 3. **时序描述一致性（Temporal Description, TD）**：分析模型对物体运动的表征是否与实际运动轨迹相符。 4. **幻觉检测能力（Hallucination Detection, HD）**：通过检测生成描述中是否包含现实中不存在的事实、动作或元素（如主观臆断或错误推论），识别模型的幻觉问题。 | 评价维度 | GPT-4o | InternVL2-26B | Qwen2-VL-7B | Elysium | Artemis | VideoRefer | | ----------------------- | ------------- | ------------- | ----------- | ---------- | ------- | ----------------- | | 主体匹配度（SC） | 3.34/4.15 | 3.55/4.08 | 2.97/3.30 | 2.35/- | -/3.42 | **4.41**/**4.44** | | 外观描述精度（AD） | 2.96/**3.31** | 2.99/3.35 | 2.24/2.54 | 0.30/- | -/1.34 | **3.27**/3.27 | | 时序描述一致性（TD） | 3.01/**3.11** | 2.57/3.08 | 2.03/2.22 | 0.02/- | -/1.39 | **3.03**/3.10 | | 幻觉检测能力（HD） | 2.50/2.43 | 2.25/2.28 | 2.31/2.12 | **3.59**/- | -/2.90 | 2.97/**3.04** | | 平均分 | 2.95/3.25 | 2.84/3.20 | 2.39/2.55 | 1.57/- | -/2.26 | **3.42**/**3.46** | ### 数据格式说明针对每个目标，我们均匀采样32帧以生成对应的掩码。基准测试集JSON文件中的数据组织格式如下： json [ { "id": 0, "video": "rLlzmcp3J6s_0:01:09.633_0:01:14.333.mp4", "caption": "The cub is a smaller, light colored lion. It is lying down and resting its head against the other lion. The cub looks calm and relaxed. It is the lion on the far left side of the frame.", "frame_idx": "36", "annotation":[ { "2":{ "segmentation": { } }, "6":{ "segmentation": { } }, ... } ] } ] - `frame_idx`：当使用单帧掩码模式时，仅采用与`frame_idx`对应的单帧掩码。 - 所有分割掩码均采用`RLE（Run-Length Encoding，游程编码）`格式。 ## VideoRefer-Bench-Q ### VideoRefer基准测试集Q版该基准旨在评估多模态大语言模型（MLLM, Multimodal Large Language Model）对视频目标的理解能力，共包含1000道高质量多项选择题。该基准覆盖五大类问题： 1. 基础问题（Basic Questions） 2. 时序顺序问题（Sequential Questions） 3. 关系推理问题（Relationship Questions） 4. 逻辑推理问题（Reasoning Questions） 5. 未来预测问题（Future Predictions） | 问题类型 | GPT-4o | GPT-4o-mini | InternVL2-26B | Qwen2-VL-7B | VideoRefer | | ----------------------- | -------- | ----------- | ------------- | ----------- | ---------- | | 基础问题 | 62.3 | 57.6 | 58.5 | 62.0 | **75.4** | | 时序顺序问题 | **74.5** | 67.1 | 63.5 | 69.6 | 68.6 | | 关系推理问题 | **66.0** | 56.5 | 53.4 | 54.9 | 59.3 | | 逻辑推理问题 | 88.0 | 85.9 | 88.0 | 87.3 | **89.4** | | 未来预测问题 | 73.7 | 75.4 | **78.9** | 74.6 | 78.1 | | 平均分 | 71.3 | 65.8 | 65.0 | 66.0 | **71.9** | ### 数据格式说明针对每个目标，我们均匀采样32帧以生成对应的掩码。基准测试集JSON文件中的数据组织格式如下： json [ { "id": 0, "video": "DAVIS/JPEGImages/480p/aerobatics", "Question": "What is <object3><region> not wearing?", "type": "Basic Questions", "options": [ "(A) A helmet", "(B) A hat", "(C) Sunglasses", "(D) A watch" ], "Answer": "(A) A helmet", "frame_idx": "57", "annotation":[ { "0":{ "segmentation": { } }, "3":{ "segmentation": { } }, ... } ] } ] - `frame_idx`：当使用单帧掩码模式时，仅采用与`frame_idx`对应的单帧掩码。 - 所有分割掩码均采用`RLE（Run-Length Encoding，游程编码）`格式。

应用场景：