PLM-Video-Human
收藏魔搭社区2026-01-06 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/PLM-Video-Human
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for PLM-Video Human
PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models,
focused on detailed video understanding. Training tasks include: fine-grained open-ended question answering (FGQA), Region-based Video Captioning (RCap),
Region-based Dense Video Captioning (RDCap) and Region-based Temporal Localization (RTLoc).
[\[📃 Tech Report\]](https://arxiv.org/abs/2504.13180)
[\[📂 Github\]](https://github.com/facebookresearch/perception_models/)
<img src="https://huggingface.co/datasets/facebook/PLM-Video-Human/resolve/main/assets/plm_video_human.png" style="width: 100%; margin: 0 auto; display: block;" />
## Dataset Structure
### Fine-Grained Question Answering (FGQA)
A video question answering dataset for fine-grained activity understanding. Contains human-annotated/verified answers to model-generated
questions about video clips from open-access video datasets. The questions focus on "what" activities
humans perform and "how" they perform these activities.
Data fields are:
- `qa_id`: a `string` feature, unique identifier for the Q&A sample.
- `segment_id`: a `string` feature, unique identifier for the video segment.
- `question`: a `string` feature, a model-generated question about the video segment
- `answer`: a `string` feature, human-annotated or human-verified answer to the question
- `metadata`: a `dict` of features, representing metadata about the video segment and Q&A pair:
- `source_video_id`: a `string` feature, video id of untrimmed source video
- `source_dataset`: a `string` feature, name of the source dataset
- `source_start_time`: a `float` feature, denoting the start time (seconds) of the video segment in the source video
- `source_end_time`: a `float` feature, denoting the end time (seconds) of the video segment in the source video
- `what_description`: a `string` feature, potential activity name shown in video (not verified)
- `q_type`: a `string` feature, question type
- `q_subtype`: a `string` feature, question subtype
- `domain`: a `string` feature, video domain
- `is_audited`: a `bool` feature, whether the sample has passed a quality audit.
A question-answer sample from FGQA looks as follows:
```
{
"qa_id":"130ae268-0ac5-4b41-8f65-137119065d81",
"segment_id":"01651739-6e54-4126-b1b5-fc87f59bda1e",
"question":"What is the initial state of the cabbage before you begin chopping it?",
"answer":"cabbage is half cut already and kept on cutting board before the person begin chopping it",
"metadata":{"source_video_id":"-eyDS81FADw",
"source_dataset":"youcook2",
"source_start_time":62.0,
"source_end_time":77.0,
"what_description":"chop garlic ginger cabbage carrot and scallions",
"q_type":"Object State",
"q_subtype":"initial_end_state",
"domain":"Cooking and Recipes",
"is_audited":0}
}
```
The `source_video_id`, `source_start_time` and `source_end_time` fields per sample can be used to obtain the training segments from each source dataset (specified in `source_dataset`).
Our training annotations contain ground-truth segments and activity names from COIN, Ego4d, EgoExo4d, CrossTask and YouCook2, as well as auto-generated segments and verified auto-generated activity names from HT100M.
### Region Video Captioning (RCap)
Each training sample is a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified video segment (time interval), the target is a caption that accurately describes the event occurring within that interval.
Data fields are :
- `uid`: an `int32` feature, unique identifier for the sample.
- `video`: a `string` feature, the video name.
- `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video.
- `total_frames`: an `int32` feature, number of video frames.
- `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment.
- `start_frame`: an `int32` feature, start frame of the temporal segment
- `end_frame`: an `int32` feature, end frame of the temporal segment
A sample from the RCap training data looks as follows:
```
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame.",
"start_frame": 30,
"end_frame": 72
}
```
Our training annotations cover videos from the SA-V (SAM-2) dataset which can be downloaded from the official website which can be downloaded from the official website [`segment-anything-videos-download`](https://ai.meta.com/datasets/segment-anything-video-downloads).
### Region Temporal Localization (RTLoc)
Each training sample is a precise time interval within the video corresponding to a detailed description of an event involving a subject of interest in the video.
Given a video, a region masklet and a textual description of the event, the targets are the start and end timestamps that correspond to the occurrence of the event.
Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.
Data fields are :
- `uid`: an `int32` feature, unique identifier for the sample.
- `video`: a `string` feature, the video name.
- `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video.
- `total_frames`: an `int32` feature, number of video frames.
- `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment.
- `start_frame`: an `int32` feature, start frame of the video segment
- `end_frame`: an `int32` feature, end frame of the video segment
A sample from RTLoc training data looks as follows:
```
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame.",
"start_frame": 30,
"end_frame": 72
}
```
Note that the start/end frames are used as output targets for RTLoc, while the caption is the output target for RCap.
### Region Dense Temporal Captioning (RDCap)
Each training sample is a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video.
Given a video and a region masklet, the target is a sequence of (start, end, caption) triplets that cover the entire duration of the video, including periods when the subject is not visible.
Data fields are :
- `uid`: an `int32` feature, unique identifier for the sample.
- `video`: a `string` feature, the video name.
- `masklet_id`: an `int32` feature, unique identifier for the input masklet within the video.
- `total_frames`: an `int32` feature, number of video frames.
- `dense_captions`: a `list` of `dict` features, each containing information per event in the video, made up of:
- `start_frame`: an `int32` feature, start frame of the video segment corresponding to the event
- `end_frame`: an `int32` feature, end frame of the video segment corresponding to the event
- `caption`: a `string` feature, the caption describing the actions of the subject/object highlighted in the masklet within the temporal segment.
A sample from RDCap training data looks as follows:
```
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"dense_captions": [
{"start_frame": 0, "end_frame": 29, "caption": "Out of frame."},
{"start_frame": 30, "end_frame": 72, "caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame."}
]
}
```
## Data Stats
The training data sizes per task are:
| | Train | Task Output |
| ----------- | ----------- | ----------- |
| FGQA | 2321035 | Answer |
| RCap | 179447 | Caption |
| RTLoc | 179447 | Temporal Segment |
| RDCap | 117248 | Dense Captions and Temporal Segments |
### Licensing Information
PLM-Video-Human data is released under CC BY 4.0. except FGQA split which is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE).
Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name.
### Citation Information
Cite as:
```
@article{cho2025PerceptionLM,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
```
# PLM-Video Human 数据集卡片
PLM-Video-Human 是一类经人工标注的资源集合,用于训练视觉语言模型(Vision Language Models),核心聚焦于精细化的视频理解任务。其支持的训练任务包括:精细化开放式问答(Fine-Grained Open-Ended Question Answering, FGQA)、基于区域的视频字幕生成(Region-Based Video Captioning, RCap)、基于区域的密集视频字幕生成(Region-Based Dense Video Captioning, RDCap)以及基于区域的时序定位(Region-Based Temporal Localization, RTLoc)。
[[📃 技术报告]](https://arxiv.org/abs/2504.13180)
[[📂 GitHub]](https://github.com/facebookresearch/perception_models/)
<img src="https://huggingface.co/datasets/facebook/PLM-Video-Human/resolve/main/assets/plm_video_human.png" style="width: 100%; margin: 0 auto; display: block;" />
## 数据集结构
### 精细化开放式问答(Fine-Grained Open-Ended Question Answering, FGQA)
本任务是面向精细化行为理解的视频问答数据集,包含针对开源视频数据集中的视频片段所生成的模型式提问,由人工标注或审核后的回答。提问核心围绕人类执行的"是什么"行为以及"如何"执行这些行为展开。
数据字段如下:
- `qa_id`:字符串类型特征,为问答样本的唯一标识符。
- `segment_id`:字符串类型特征,为视频片段的唯一标识符。
- `question`:字符串类型特征,针对该视频片段由模型生成的提问。
- `answer`:字符串类型特征,针对该提问的人工标注或人工审核后的回答。
- `metadata`:特征字典,存储该视频片段与问答对的元数据:
- `source_video_id`:字符串类型特征,原始未剪辑视频的视频ID。
- `source_dataset`:字符串类型特征,来源数据集的名称。
- `source_start_time`:浮点型特征,表示该视频片段在原始视频中的起始时间(单位:秒)。
- `source_end_time`:浮点型特征,表示该视频片段在原始视频中的结束时间(单位:秒)。
- `what_description`:字符串类型特征,视频中呈现的潜在行为名称(未经审核)。
- `q_type`:字符串类型特征,提问类型。
- `q_subtype`:字符串类型特征,提问子类型。
- `domain`:字符串类型特征,视频所属领域。
- `is_audited`:布尔型特征,表示该样本是否通过质量审核。
以下为一个FGQA的问答样本示例:
json
{
"qa_id":"130ae268-0ac5-4b41-8f65-137119065d81",
"segment_id":"01651739-6e54-4126-b1b5-fc87f59bda1e",
"question":"What is the initial state of the cabbage before you begin chopping it?",
"answer":"cabbage is half cut already and kept on cutting board before the person begin chopping it",
"metadata":{"source_video_id":"-eyDS81FADw",
"source_dataset":"youcook2",
"source_start_time":62.0,
"source_end_time":77.0,
"what_description":"chop garlic ginger cabbage carrot and scallions",
"q_type":"Object State",
"q_subtype":"initial_end_state",
"domain":"Cooking and Recipes",
"is_audited":0}
}
每个样本的`source_video_id`、`source_start_time`与`source_end_time`字段可用于从对应来源数据集(由`source_dataset`指定)中提取训练用视频片段。本数据集的训练标注包含来自COIN、Ego4d、EgoExo4d、CrossTask与YouCook2的真实片段与行为名称,同时也包含来自HT100M的自动生成片段与经审核的自动生成行为名称。
### 基于区域的视频字幕生成(Region-Based Video Captioning, RCap)
每个训练样本对应视频中某一关注主体相关事件的详细描述。当给定区域掩码与指定的视频片段(时间区间)时,模型的训练目标是生成精准描述该区间内发生事件的字幕。
数据字段如下:
- `uid`:int32类型特征,样本的唯一标识符。
- `video`:字符串类型特征,视频文件名。
- `masklet_id`:int32类型特征,该视频中输入掩码块的唯一标识符。
- `total_frames`:int32类型特征,视频总帧数。
- `caption`:字符串类型特征,描述该时序区间内掩码块所高亮的主体/对象行为的字幕。
- `start_frame`:int32类型特征,该时序片段的起始帧。
- `end_frame`:int32类型特征,该时序片段的结束帧。
以下为一个RCap训练样本示例:
json
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame.",
"start_frame": 30,
"end_frame": 72
}
本数据集的训练标注涵盖来自SA-V(SAM-2)数据集的视频,该数据集可通过官方链接[`segment-anything-videos-download`](https://ai.meta.com/datasets/segment-anything-video-downloads)下载。
### 基于区域的时序定位(Region-Based Temporal Localization, RTLoc)
每个训练样本对应视频中某一精准时间区间,该区间与视频内关注主体相关的某一事件的详细描述相匹配。当给定视频、区域掩码块与事件的文本描述时,模型的训练目标是生成对应事件发生的起始与结束时间戳。值得注意的是,本任务与RCap互为逆任务:RCap需要基于区间与掩码生成字幕,而本任务则是将字幕作为输入,生成对应的时间区间。
数据字段如下:
- `uid`:int32类型特征,样本的唯一标识符。
- `video`:字符串类型特征,视频文件名。
- `masklet_id`:int32类型特征,该视频中输入掩码块的唯一标识符。
- `total_frames`:int32类型特征,视频总帧数。
- `caption`:字符串类型特征,描述该时序区间内掩码块所高亮的主体/对象行为的字幕。
- `start_frame`:int32类型特征,该时序片段的起始帧。
- `end_frame`:int32类型特征,该时序片段的结束帧。
以下为一个RTLoc训练样本示例:
json
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame.",
"start_frame": 30,
"end_frame": 72
}
需注意,RTLoc的训练目标为起始/结束帧,而RCap的训练目标为字幕内容。
### 基于区域的密集时序字幕生成(Region-Based Dense Temporal Captioning, RDCap)
每个训练样本对应视频中某一特定关注主体(例如人物、动物或物体)相关的所有事件的详细描述。当给定视频与区域掩码块时,模型的训练目标是生成覆盖视频全时长的(起始帧、结束帧、字幕)三元组序列,其中也包含主体不可见的时段。
数据字段如下:
- `uid`:int32类型特征,样本的唯一标识符。
- `video`:字符串类型特征,视频文件名。
- `masklet_id`:int32类型特征,该视频中输入掩码块的唯一标识符。
- `total_frames`:int32类型特征,视频总帧数。
- `dense_captions`:字典特征列表,存储视频中每个事件的相关信息,每个字典包含以下内容:
- `start_frame`:int32类型特征,该事件对应视频片段的起始帧。
- `end_frame`:int32类型特征,该事件对应视频片段的结束帧。
- `caption`:字符串类型特征,描述该时序区间内掩码块所高亮的主体/对象行为的字幕。
以下为一个RDCap训练样本示例:
json
{
"uid": 0,
"video": "sav_017599.mp4",
"masklet_id": 2,
"total_frames": 73,
"dense_captions": [
{"start_frame": 0, "end_frame": 29, "caption": "Out of frame."},
{"start_frame": 30, "end_frame": 72, "caption": "A boy enters the frame from the right, he wears glasses and turn back and exit from the right side of the frame."}
]
}
## 数据统计
各任务的训练数据规模如下:
| | 训练集规模 | 任务输出目标 |
| ----------- | ----------- | ----------- |
| FGQA | 2321035 | 回答 |
| RCap | 179447 | 字幕 |
| RTLoc | 179447 | 时序片段 |
| RDCap | 117248 | 密集字幕与时序片段 |
## 授权信息
PLM-Video-Human 数据集除FGQA子集外,均采用CC BY 4.0协议发布。FGQA子集的内容源自Llama 3.2生成,需遵循Llama 3.2授权协议(https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)。若使用本数据集训练、微调或以其他方式改进AI模型,并将该模型分发或公开提供,则该AI模型的名称需以"Llama"作为前缀。
## 引用信息
引用格式如下:
bibtex
@article{cho2025PerceptionLM,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr"a"henb"u"hl and Piotr Doll"a"r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
提供机构:
maas
创建时间:
2025-05-20



