PLM-VideoBench

Name: PLM-VideoBench
Creator: maas
Published: 2025-12-05 16:35:13
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/facebook/PLM-VideoBench

下载链接

链接失效反馈

官方服务：

资源简介：

### Dataset Summary PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding. [\[📃 Tech Report\]](https://arxiv.org/abs/2504.13180) [\[📂 Github\]](https://github.com/facebookresearch/perception_models/) <img src="https://huggingface.co/datasets/facebook/PLM-VideoBench/resolve/main/assets/plm_videobench.png" style="width: 100%; margin: 0 auto; display: block;" /> ### Supported Tasks PLM-VideoBench includes evaluation data for the following tasks: #### FGQA In this task, a model must answer a multiple-choice question (MCQ) that probes fine-grained activity understanding. Given a question and multiple options that differ in a fine-grained detail (e.g., painting vertically vs. horizontally), the model must select the correct answer. To reduce bias, we follow prior work and report multi-binary accuracy (MBAcc). Specifically, each question is split into multiple binary-choice questions, where the correct answer is compared with one distractor at a time; a prediction is considered correct only when the correct answer is consistently selected across all binary comparisons. Data fields are: - `uid`: a `string` feature, unique identifier for the binary question. - `qa_id`: a `string` feature, unique identifier for the Q&A sample. - `video`: a `string` feature, unique identifier for the video segment. - `question`: a `string` feature, the question about the video segment. - `answer`: a `string` feature, the groud truth answer to the question. - `options`: a `struct` feature representing the two potential answers to the binary question. - `answer_index`: a `int32` feature, the index of the correct answer within the options. - `metadata`: a `dict` of features, representing metadata about the video segment and Q&A pair: - `source_dataset`: a `string` feature, name of the source dataset. - `source_video_id`: a `string` feature, video id of untrimmed source video. - `source_start_time`: a `float` feature, denoting the start time (seconds) of the video segment in the source video. - `source_end_time`: a `float` feature, denoting the end time (seconds) of the video segment in the source video. - `q_type`: a `string` feature, denoting the question type. - `domain`: a `string` feature, denoting the video domain. An example sample from FGQA looks as follows: ``` { "uid":"ced44497-11d4-4fb9-bcf3-0fa5924c1401", "qa_uid":"7fcbd367-fdcf-4de5-97de-42496d1f0520", "video":"segment_b33e3b27-0127-492f-a9f3-f04e7ac6006e.mp4", "question":"What is the state and location of the butter at the beginning of the step?", "answer":"The butter is partly melted inside a saucepan on the stove's bottom left burner.", "options": {"option_0":"The butter is partly melted inside a saucepan on the stove's bottom left burner.","option_1":"The butter is completely melted inside a saucepan on the stove's bottom left burner." }, "answer_index":0, "metadata": {"source_dataset":"ht100m", "source_video_id":"1gkuLOJxaa8", "source_start_time":30.74, "source_end_time":426.61, "question_type":"Object State", "source_domain":"Cooking and Recipes" } } ``` The `source_video_id`, `source_start_time` and `source_end_time` fields per sample can be used to obtain the segments from each source dataset (specified in `source_dataset`). Note: For EgoExo4d segments, information for the view (camera name) corresponding to each sample can be found in `metadata/fgqa_test_egoexo4d_segment2cam.csv`. Our annotations contain ground-truth segments from COIN, Ego4d, EgoExo4d, CrossTask and YouCook2, as well as auto-generated segments from HT100M. #### SGQA In this task, a model must answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smart-glasses device. The questions are designed to simulate real-world scenarios where a user would ask for assistance from their smart glasses, such as "which of these two jackets would look better with this pair of shoes?" or "does this pasta look strained enough to you?". The source videos used to construct this benchmark component were independently collected and are not based on existing publicly available data. To evaluate performance we use LLM-judge accuracy. Data fields are: - `uid`: a `string` feature, unique identifier for the binary question. - `video`: a `string` feature, unique identifier for the video segment. - `question`: a `string` feature, the question about the video segment. - `answer`: a `string` feature, the groud truth answer to the question. - `domain`: a `string` feature, video domain. An example from SGQA looks as follows: ``` { "uid": 0, "video": "dee38522f7ad7a55_481_509.mp4", "question": "Am I focusing my gaze in the right place for this movement?", "answer": "You are focusing on your right side, which improves balance and stability. " } ``` #### RCap In this task, the model must generate a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified time interval, the model is required to output a caption that accurately describes the event occurring within that interval. The test set contains 10060 instances. We report LLM-judge accuracy to assesses the quality of the generated captions. Data fields are : - `uid`: an `int32` feature, unique identifier for the sample. - `video`: a `string` feature, the video name. - `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video. - `total_frames`: an `int32` feature, number of video frames. - `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment. - `start_frame`: an `int32` feature, start frame of the temporal segment - `end_frame`: an `int32` feature, end frame of the temporal segment An example from RCAP looks as follows: ``` { "uid": 0, "video": "01f131a1-a172-47ec-a6b9-251a1290cb7c.mp4", "masklet_id": 0, "total_frames": 76, "caption": "A white goat is grazing the grass with other goats in a rural area.", "start_frame": 0, "end_frame": 20 } ``` #### RTLoc In this task, the model must identify the precise time interval within the video when the specified event takes place for the given subject. Given a video, a region masklet and a textual description of the event, the model is required to output the start and end timestamps that correspond to the occurrence of the event. Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval. Data fields are : - `uid`: an `int32` feature, unique identifier for the sample. - `video`: a `string` feature, the video name. - `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video. - `total_frames`: an `int32` feature, number of video frames. - `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment. - `start_frame`: an `int32` feature, start frame of the video segment - `end_frame`: an `int32` feature, end frame of the video segment An example from RTLOC looks as follows: ``` { "uid": 0, "video": "01f131a1-a172-47ec-a6b9-251a1290cb7c.mp4", "masklet_id": 0, "total_frames": 76, "caption": "A white goat is grazing the grass with other goats in a rural area.", "start_frame": 0, "end_frame": 20 } ``` #### RDCap In this task, a model must generate a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the model must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible. We report SODA score, which leverages an LLM judge to assess the quality of the generated captions. Data fields are : - `uid`: an `int32` feature, unique identifier for the sample. - `video`: a `string` feature, the video name. - `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video. - `total_frames`: an `int32` feature, number of video frames. - `dense_captions`: a `list` of `dict` features, each containing information per event in the video, made up of: - `start_frame`: an `int32` feature, start frame of the video segment corresponding to the event - `end_frame`: an `int32` feature, end frame of the video segment corresponding to the event - `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment. An example from RDCAP looks as follows: ``` { "uid": 0, "video": "0158cd03-2bff-428e-8787-6393f0edf2a4.mp4", "masklet_id": 2, "total_frames": 73, "dense_captions": [ {"start_frame": 0, "end_frame": 29, "caption": "Out of frame."}, {"start_frame": 30, "end_frame": 72, "caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame."} ] } ``` ### Evaluation **Standalone evaluation scripts:** We provide standalone evaluation scripts as reference in [scripts/evaluate_plm.py](scripts/evaluate_plm.py). These require predictions in a specific format per task, provided in each method header. Please install [vllm](https://github.com/vllm-project/vllm) for LLM-judge evaluations. We use Llama-3.3-70B-Instruct as the LLM-judge. Example usage: ``` python evaluate_plm.py \ --gt_file {task}/plm_{task}_test.jsonl \ --pred_file test_predictions.jsonl \ --task {task} \ --out_file metrics.json ``` `gt_file` is the path to the task jsonl in the current repo. Results will be saved in `out_file`. **lmms-evals integration:** Apart from the standalone scripts, we integrate our tasks, models and evaluation code into [lmms-evals](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/plm_videobench) for easy evaluation. ### Licensing Information PLM-VideoBench data is released under CC BY 4.0. except FGQA split which is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name. ### Citation Information Cite as: ``` @article{cho2025PerceptionLM, title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding}, author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer}, journal={arXiv}, year={2025} } ```

### 数据集概述 PLM-VideoBench是一套经人工标注的资源集合，用于评估视觉语言模型，核心聚焦于精细化视频理解任务。 [[📃 技术报告]](https://arxiv.org/abs/2504.13180) [[📂 GitHub]](https://github.com/facebookresearch/perception_models/) <img src="https://huggingface.co/datasets/facebook/PLM-VideoBench/resolve/main/assets/plm_videobench.png" style="width: 100%; margin: 0 auto; display: block;" /> ### 支持任务 PLM-VideoBench包含以下任务的评估数据： #### 细粒度问答（FGQA）在该任务中，模型需要回答一道多选题（multiple-choice question，MCQ），以测试其对活动的细粒度理解能力。给定一道问题以及仅在细粒度细节上存在差异的多个选项（例如，垂直绘制与水平绘制），模型需选出正确答案。为降低偏差，我们遵循此前研究的做法，采用多二元准确率（MBAcc）作为评估指标。具体而言，将每道问题拆分为多个二元选择题，每次将正确答案与一个干扰项进行比较；仅当模型在所有二元比较中均一致选出正确答案时，该预测才被视为正确。数据字段如下： - `uid`: 字符串类型特征，为该二元问题的唯一标识符。 - `qa_id`: 字符串类型特征，为该问答样本的唯一标识符。 - `video`: 字符串类型特征，为该视频片段的唯一标识符。 - `question`: 字符串类型特征，关于该视频片段的问题。 - `answer`: 字符串类型特征，该问题的真实标注答案（ground truth）。 - `options`: 结构体特征，代表该二元问题的两个候选答案。 - `answer_index`: int32类型特征，代表正确答案在选项中的索引。 - `metadata`: 字典类型特征，包含该视频片段和问答对的元数据： - `source_dataset`: 字符串类型特征，源数据集名称。 - `source_video_id`: 字符串类型特征，未剪辑源视频的视频ID。 - `source_start_time`: float类型特征，表示该视频片段在源视频中的起始时间（单位：秒）。 - `source_end_time`: float类型特征，表示该视频片段在源视频中的结束时间（单位：秒）。 - `q_type`: 字符串类型特征，表示问题类型。 - `domain`: 字符串类型特征，表示视频所属领域。 FGQA的示例样本如下： { "uid":"ced44497-11d4-4fb9-bcf3-0fa5924c1401", "qa_uid":"7fcbd367-fdcf-4de5-97de-42496d1f0520", "video":"segment_b33e3b27-0127-492f-a9f3-f04e7ac6006e.mp4", "question":"What is the state and location of the butter at the beginning of the step?", "answer":"The butter is partly melted inside a saucepan on the stove's bottom left burner.", "options": {"option_0":"The butter is partly melted inside a saucepan on the stove's bottom left burner.","option_1":"The butter is completely melted inside a saucepan on the stove's bottom left burner." }, "answer_index":0, "metadata": {"source_dataset":"ht100m", "source_video_id":"1gkuLOJxaa8", "source_start_time":30.74, "source_end_time":426.61, "question_type":"Object State", "source_domain":"Cooking and Recipes" } } 每个样本的`source_video_id`、`source_start_time`和`source_end_time`字段可用于从对应源数据集（由`source_dataset`指定）中获取视频片段。注意：对于EgoExo4d片段，每个样本对应的视角（相机名称）信息可在`metadata/fgqa_test_egoexo4d_segment2cam.csv`中查看。我们的标注包含来自COIN、Ego4d、EgoExo4d、CrossTask和YouCook2的真实标注片段，以及来自HT100M的自动生成片段。 #### 开放域场景问答（SGQA）在该任务中，模型需要回答关于智能眼镜设备录制的第一人称视频流中可见活动与物体的开放式问题。这些问题旨在模拟真实场景，即用户会通过智能眼镜寻求帮助，例如“这两件夹克中哪一件搭配这双鞋更好看？”或“这份意大利面看起来滤得够干了吗？”。用于构建该基准组件的源视频是独立收集的，并非基于现有公开数据集。我们采用大语言模型评判准确率（LLM-judge accuracy）作为评估指标。数据字段如下： - `uid`: 字符串类型特征，为该二元问题的唯一标识符。 - `video`: 字符串类型特征，为该视频片段的唯一标识符。 - `question`: 字符串类型特征，关于该视频片段的问题。 - `answer`: 字符串类型特征，该问题的真实标注答案。 - `domain`: 字符串类型特征，视频所属领域。 SGQA的示例样本如下： { "uid": 0, "video": "dee38522f7ad7a55_481_509.mp4", "question": "Am I focusing my gaze in the right place for this movement?", "answer": "You are focusing on your right side, which improves balance and stability. " } #### 区域聚焦字幕生成（RCap）在该任务中，模型需要生成一段关于视频中指定兴趣主体的事件的详细描述。给定一个区域掩码以及指定的时间区间，模型需输出一段准确描述该区间内发生事件的字幕。测试集包含10060个样本。我们采用大语言模型评判准确率来评估生成字幕的质量。数据字段如下： - `uid`: int32类型特征，为该样本的唯一标识符。 - `video`: 字符串类型特征，视频名称。 - `masklet_id`: int32类型特征，为该视频中输入掩码块（masklet）的唯一标识符。 - `total_frames`: int32类型特征，视频总帧数。 - `caption`: 字符串类型特征，描述该时间区间内掩码块高亮的主体/物体动作的字幕。 - `start_frame`: int32类型特征，该时间区间的起始帧。 - `end_frame`: int32类型特征，该时间区间的结束帧。 RCap的示例样本如下： { "uid": 0, "video": "01f131a1-a172-47ec-a6b9-251a1290cb7c.mp4", "masklet_id": 0, "total_frames": 76, "caption": "A white goat is grazing the grass with other goats in a rural area.", "start_frame": 0, "end_frame": 20 } #### 区域感知时间定位（RTLoc）在该任务中，模型需要识别视频中指定主体的指定事件发生的精确时间区间。给定一段视频、一个区域掩码块以及该事件的文本描述，模型需输出对应事件发生的起始和结束时间戳。值得注意的是，该任务与RCap互为逆任务——前者将字幕作为输入，生成对应的时间区间，而非生成字幕。数据字段如下： - `uid`: int32类型特征，为该样本的唯一标识符。 - `video`: 字符串类型特征，视频名称。 - `masklet_id`: int32类型特征，为该视频中输入掩码块的唯一标识符。 - `total_frames`: int32类型特征，视频总帧数。 - `caption`: 字符串类型特征，描述该时间区间内掩码块高亮的主体/物体动作的字幕。 - `start_frame`: int32类型特征，该视频片段的起始帧。 - `end_frame`: int32类型特征，该视频片段的结束帧。 RTLoc的示例样本如下： { "uid": 0, "video": "01f131a1-a172-47ec-a6b9-251a1290cb7c.mp4", "masklet_id": 0, "total_frames": 76, "caption": "A white goat is grazing the grass with other goats in a rural area.", "start_frame": 0, "end_frame": 20 } #### 区域密集字幕生成（RDCap）在该任务中，模型需要生成一段关于视频中指定兴趣主体（例如人物、动物或物体）的所有事件的详细描述。给定一段视频和一个区域掩码块，模型需生成一系列（起始帧、结束帧、字幕）元组，覆盖视频的完整时长，包括主体不可见的时段。我们采用SODA评分作为评估指标，该指标借助大语言模型（LLM）评判来评估生成字幕的质量。数据字段如下： - `uid`: int32类型特征，为该样本的唯一标识符。 - `video`: 字符串类型特征，视频名称。 - `masklet_id`: int32类型特征，为该视频中输入掩码块的唯一标识符。 - `total_frames`: int32类型特征，视频总帧数。 - `dense_captions`: 字典特征列表，每个元素包含视频中单个事件的信息，由以下部分组成： - `start_frame`: int32类型特征，该事件对应视频片段的起始帧。 - `end_frame`: int32类型特征，该事件对应视频片段的结束帧。 - `caption`: 字符串类型特征，描述该时间区间内掩码块高亮的主体/物体动作的字幕。 RDCap的示例样本如下： { "uid": 0, "video": "0158cd03-2bff-428e-8787-6393f0edf2a4.mp4", "masklet_id": 2, "total_frames": 73, "dense_captions": [ {"start_frame": 0, "end_frame": 29, "caption": "Out of frame."}, {"start_frame": 30, "end_frame": 72, "caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame."} ] } ### 评估 **独立评估脚本：** 我们在[scripts/evaluate_plm.py](scripts/evaluate_plm.py)中提供了独立的评估脚本作为参考。这些脚本要求输入符合各任务特定格式的预测结果，具体格式可参见各方法的说明文档。如需进行大语言模型评判评估，请安装[vllm](https://github.com/vllm-project/vllm)。我们采用Llama-3.3-70B-Instruct作为大语言模型评判模型。示例用法： python evaluate_plm.py --gt_file {task}/plm_{task}_test.jsonl --pred_file test_predictions.jsonl --task {task} --out_file metrics.json `gt_file`为当前仓库中对应任务的jsonl文件路径。评估结果将保存至`out_file`。 **lmms-evals 集成：** 除独立脚本外，我们还将任务、模型与评估代码集成至[lmms-evals](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/plm_videobench)，以便于开展评估工作。 ### 许可信息 PLM-VideoBench数据集采用CC BY 4.0协议发布，但FGQA拆分集除外——该拆分集由Llama 3.2生成，需遵循Llama 3.2许可协议（https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE）。若使用该数据集训练、微调或以其他方式改进人工智能模型并进行分发或公开提供，则此类人工智能模型的名称需以“Llama”开头。 ### 引用信息引用格式如下： @article{cho2025PerceptionLM, title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding}, author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Krähenbühl and Piotr Dollár and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer}, journal={arXiv}, year={2025} }

提供机构：

maas

创建时间：

2025-05-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集