PE-Video
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/PE-Video
下载链接
链接失效反馈官方服务:
资源简介:
# PE Video Dataset (PVD)
[\[📃 Tech Report\]](https://arxiv.org/abs/2504.13181)
[\[📂 Github\]](https://github.com/facebookresearch/perception_models/)
The PE Video Dataset (PVD) is a large-scale collection of 1 million diverse videos, featuring 120,000+ expertly annotated clips. The dataset was introduced in our paper "Perception Encoder".
## Overview
PE Video Dataset (PVD) comprises 1M high quality and diverse videos. Among them, 120K videos are accompanied by automated and human-verified annotations. and all videos are accompanied with video description and keywords. The videos are motion-centered, covering both first-person and third-person views with a wide coverage of scenes.
## PVD
### Key Application
Computer Vision, Video Understanding
### Intended Use Cases
Train and evaluate video retrieval models
Train and evaluate video captioning models
Primary Data type
Videos
Video caption (Human annotated / Model generated)
### Data Function
Training, Testing
### Dataset Characteristics
- Total number of videos: 998,862
- Total number of human annotated captions: 118,862
- Average FPS: 29.8
- Average Video Length: 16.7s
- Average video height: 346
- Average video width: 604
### Labels
A text description that summarizes the content of a video describing what's happening in the video, such as the actions, events, or objects shown.
### Nature Of Content
We selected videos from 10 different categories, including hand actions, object interactions, food preparation, work activities, outdoor scenes, animals, water scenes, object handling, close-up shots, and nature scenes.
### License
CC BY NC 4.0
### Access Cost
Open access
### Labeling Methods
The video captions are refined based on the following criteria. The annotators should remove any hallucinations found in the model-generated caption, correct words that describe the video inaccurately, and eliminate repeating or redundant words to make the caption concise and accurate. Additionally, if major actions are missing from the caption, annotators should add them in a concise and natural way.
### Validation Methods
All of the 118,862 human captions were reviewed by human annotators.
### Citation
If you find this dataset useful, please cite our papers:
```
@article{bolya2025perception-encoder,
title={Perception Encoder: The best visual embeddings are not at the output of the network},
author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
@article{cho2025perceptionlm,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
```
# PE视频数据集(PE Video Dataset,PVD)
[📃 技术报告](https://arxiv.org/abs/2504.13181)
[📂 Github仓库](https://github.com/facebookresearch/perception_models/)
PE视频数据集(PE Video Dataset,PVD)是一个包含100万条多样化视频的大规模集合,其中包含12万余条经过专业标注的片段。该数据集由我们的论文《感知编码器(Perception Encoder)》首次提出。
## 概述
PE视频数据集(PE Video Dataset,PVD)包含100万条高质量、多样化的视频。其中12万条视频附带自动化生成并经人工校验的标注信息,所有视频均配有视频描述与关键词。该数据集以动态内容为核心,涵盖第一人称与第三人称视角,场景覆盖范围广泛。
## PVD
### 核心应用领域
计算机视觉、视频理解
### 预期应用场景
训练与评估视频检索模型
训练与评估视频字幕生成模型
### 核心数据类型
视频
视频字幕(人工标注/模型生成)
### 数据用途
训练、测试
### 数据集特性
- 总视频数量:998,862
- 人工标注字幕总数:118,862
- 平均帧率(FPS):29.8
- 平均视频时长:16.7秒
- 平均视频高度:346像素
- 平均视频宽度:604像素
### 标注标签
用于总结视频内容的文本描述,说明视频中发生的动作、事件或呈现的物体等信息。
### 内容类别
数据集涵盖10大类视频,分别为手部动作、物体交互、食物制备、工作活动、户外场景、动物、水景、物体操作、特写镜头以及自然场景。
### 授权协议
CC BY-NC 4.0
### 访问成本
开放获取
### 标注流程
视频字幕需按照以下标准进行优化:标注人员需移除模型生成字幕中的幻觉内容,修正对视频描述不准确的表述,删除重复或冗余词汇,以确保字幕简洁准确。此外,若字幕中遗漏了视频中的核心动作,标注人员需以简洁自然的方式将其补充完整。
### 校验方式
所有118,862条人工生成的字幕均由人工标注人员进行审核。
### 引用说明
若您认为本数据集对您的研究有所帮助,请引用以下论文:
@article{bolya2025perception-encoder,
title={Perception Encoder: The best visual embeddings are not at the output of the network},
author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Dollár and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
@article{cho2025perceptionlm,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Krähenbühl and Piotr Dollár and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍

背景与挑战
背景概述
PE Video Dataset (PVD)是一个包含约100万高质量多样化视频的大规模数据集,其中12万视频带有专家标注。该数据集以运动为中心,涵盖多种视角和场景,主要用于计算机视觉和视频理解任务,采用CC BY NC 4.0许可协议开放访问。
以上内容由遇见数据集搜集并总结生成



