Mutonix/Vript
收藏Hugging Face2024-06-11 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/Mutonix/Vript
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- video-classification
- visual-question-answering
- text-to-video
- text-to-image
- image-to-video
language:
- en
size_categories:
- 100K<n<1M
configs:
- config_name: vript-long
data_files:
- split: train
path: vript_captions/vript_long_videos_captions.jsonl
- config_name: vript-short
data_files:
- split: train
path: vript_captions/vript_short_videos_captions.jsonl
---
# 🎬 Vript: Refine Video Captioning into Video Scripting [[Github Repo](https://github.com/mutonix/Vript)]
---
We construct a **fine-grained** video-text dataset with 12K annotated high-resolution videos **(~400k clips)**. The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with **~145** words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.
**_<font color=red>Warning: Some zip files may contain empty folders. You can ignore them as these folders have no video clips and no annotation files.</font>_**
<p align="center">
<img src="assets/Vript-overview_00.png" width="800">
</p>
## Getting Started
**By downloading these datasets, you agree to the terms of the [License](#License).**
The captions of the videos in the Vript dataset are structured as follows:
```
{
"meta": {
"video_id": "339dXVNQXac",
"video_title": "...",
"num_clips": ...,
"integrity": true,
},
"data": {
"339dXVNQXac-Scene-001": {
"video_id": "339dXVNQXac",
"clip_id": "339dXVNQXac-Scene-001",
"video_title": "...",
"caption":{
"shot_type": "...",
"camera_movement": "...",
"content": "...",
"scene_title": "...",
},
"voiceover": ["..."],
},
"339dXVNQXac-Scene-002": {
...
}
}
}
```
- `video_id`: The ID of the video from YouTube.
- `video_title`: The title of the video.
- `num_clips`: The number of clips in the video. If the `integrity` is `false`, some clips may not be captioned.
- `integrity`: Whether all clips are captioned.
- `clip_id`: The ID of the clip in the video, which is the concatenation of the `video_id` and the scene number.
- `caption`: The caption of the scene, including the shot type, camera movement, content, and scene title.
- `voiceover`: The transcription of the voice-over in the scene.
The data is organized as follows:
```
Vript/
|
├── vript_meta/
│ ├── vript_long_videos_meta.json
│ └── vript_short_videos_meta.json
│
├── vript_captions/
│ ├── vript_long_videos_captions.zip
│ │ ├── 007EvOaWFOA_caption.json
│ │ └── ...
│ └── vript_short_videos_captions.zip
│ └── ...
│
├── vript_long_videos/
│ ├── video_1_of_1095.zip
│ │ ├── 007EvOaWFOA.mp4
│ │ └── ...
│ ├── video_2_of_1095.zip
│ └── ...
│
├── vript_short_videos/
│ ├── short_video_1_of_42.zip
│ │ ├── 02toZL7p4_0.mp4
│ │ └── ...
│ ├── short_video_2_of_42.zip
│ └── ...
│
├── vript_long_videos_clips/
│ ├── clips_1_of_1095.zip
│ │ ├── 007EvOaWFOA/
│ │ │ ├── 007EvOaWFOA_cut_meta.json
│ │ │ ├── 007EvOaWFOA_asr.jsonl
│ │ │ ├── 007EvOaWFOA-Scene-001.mp4
│ │ │ └── ...
│ │ └── ...
│ ├── clips_2_of_1095.zip
│ └── ...
│
└── vript_short_videos_clips/
├── shorts_clips_1_of_42.zip
│ ├── 02toZL7p4_0/
│ │ ├── 02toZL7p4_0_cut_meta.json
│ │ ├── 02toZL7p4_0_asr.jsonl
│ │ ├── 02toZL7p4_0-Scene-001.mp4
│ │ └── ...
│ └── ...
├── shorts_clips_2_of_42.zip
└── ...
```
- `vript_meta/`: The meta information of the videos in the Vript dataset, including the video id, title, url, description, category, etc.
- `vript_captions/`: The video captions of the videos in the Vript dataset, which are structured as described above.
- `vript_long_videos/` (667 GB) and `vript_short_videos/` (8.8 GB): The untrimmed videos in the Vript dataset. Long videos are from YouTube, and short videos are from YouTube Shorts and TikTok. We divide the whole data into multiple zip files, each containing 10 long videos / 50 short videos.
All the videos are in **720p** resolution, and _we will provide the videos in the highest quality (up to 2K) available later_ (or you can download them from YouTube directly).
- `vript_long_videos_clips/` (822 GB) and `vript_short_videos_clips/` (12 GB): The trimmed video clips in the Vript dataset, which correspond to scenes in the `video_captions`.
- `xxx_cut_meta.json`: The meta information about how the video is trimmed, including the start time, end time, and the duration of the scene.
- `xxx_asr.jsonl`: The transcription of the voice-over in the scene.
## License
By downloading or using the data or model, you understand, acknowledge, and agree to all the terms in the following agreement.
- ACADEMIC USE ONLY
Any content from Vript/Vript-Bench dataset and Vriptor model is available for academic research purposes only. You agree not to reproduce, duplicate, copy, trade, or exploit for any commercial purposes
- NO DISTRIBUTION
Respect the privacy of personal information of the original source. Without the permission of the copyright owner, you are not allowed to perform any form of broadcasting, modification or any other similar behavior to the data set content.
- RESTRICTION AND LIMITATION OF LIABILITY
In no event shall we be liable for any other damages whatsoever arising out of the use of, or inability to use this dataset and its associated software, even if we have been advised of the possibility of such damages.
- DISCLAIMER
You are solely responsible for legal liability arising from your improper use of the dataset content. We reserve the right to terminate your access to the dataset at any time. You should delete the Vript/Vript-Bench dataset or Vriptor model if required.
This license is modified from the [HD-VG-100M](https://github.com/daooshee/HD-VG-130M) license.
<!-- ## Citation
```
``` -->
## Contact
**Dongjie Yang**: [djyang.tony@sjtu.edu.cn](djyang.tony@sjtu.edu.cn)
Paper: arxiv.org/abs/2406.06040
提供机构:
Mutonix
原始信息汇总
数据集概述
数据集内容
- 包含12,000个高分辨率视频,约400,000个视频片段。
- 视频内容通过视频脚本格式进行标注,每个场景包含详细的描述,包括内容、镜头类型(如中景、特写等)和摄像机移动方式(如平移、倾斜等)。
标注特点
- 采用视频脚本格式进行密集标注,不遗漏任何视频场景。
- 每个场景配有约145字的描述文本。
附加信息
- 视频的旁白被转录为文本,与视频标题一同提供,以增强视频背景信息的丰富性。



