five

tencent/Penguin-Recap-V

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tencent/Penguin-Recap-V
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: sharegpt4video default: true data_files: - split: train path: data/sharegpt4video/*_relative.jsonl - config_name: shortvideo data_files: - split: train path: data/shortvideo/*_relative.jsonl - config_name: vidal_10m data_files: - split: train path: data/vidal_10m/*_relative.jsonl - config_name: finevideo data_files: - split: train path: data/finevideo/*_relative.jsonl - config_name: multi_moments_in_time data_files: - split: train path: data/multi_moments_in_time/*_relative.jsonl tags: - multimodal - video-text - captioning - metadata-only size_categories: - 1M<n<10M --- # Penguin-Recap-V Penguin-Recap-V provides Multi-granularity video annotation. This figure illustrates the alignment between visual content and textual descriptions across **three temporal scales: Dense time-level, Paragraph-level, and Video-level**. ## Included subsets | subset | source collection | videos / clips | expected rows | source jsonl | | --- | --- | ---: | ---: | --- | | `sharegpt4video` | ShareGPT4Video | 40,145 | 120,435 | `sharegpt4video/predictions_process_relative.jsonl` | | `shortvideo` | ShortVideo | 147,326 | 441,978 | `shortvideodataset/predictions_process_relative.jsonl` | | `vidal_10m` | VIDAL-10M | 1,393,902 | 4,181,706 | `vidal_10m/predictions_process_relative.jsonl` | | `finevideo` | FineVideo | 35,780 | 107,340 | `finevideo/predictions_process_fixed_relative.jsonl` | | `multi_moments_in_time` | Multi-Moments in Time | 1,003,391 | 1,003,391 | `multi_moments_in_time/predictions_process_relative.jsonl` | Expected total rows: **5,854,850** ## Annotation structure Each line is a standalone JSON object: ```json {"video": ["./ShareGPT4Video/zip_folder/panda/panda_videos_16/j5JqNXjGufw.mp4"], "conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]} ``` The annotation layout follows the processing notes used to prepare the dataset: - Each `.jsonl` file contains one JSON object per line. - For the same video, rows are consecutive in the file for all subsets except `multi_moments_in_time`. - **The standard three-row order is:**<br> **1. Dense time-level caption**<br> **2. Paragraph-level caption**<br> **3. Video-level caption / summary** - `multi_moments_in_time` is the special case: clips are shorter than 5 seconds and usually contain a single action, so only summary-style annotations were kept. The final training setup used QA data rather than caption supervision for this subset. ## Per-subset layout | subset | row layout | | --- | --- | | `sharegpt4video` | 40,145 videos x 3 rows: dense caption, paragraph caption, summary | | `shortvideo` | 147,326 videos x 3 rows: dense caption, paragraph caption, summary | | `vidal_10m` | 1,393,902 videos x 3 rows: dense caption, paragraph caption, summary | | `finevideo` | 35,780 videos x 3 rows: dense caption, paragraph caption, summary | | `multi_moments_in_time` | 1,003,391 clips, summary only | ## Relative path convention The uploaded files already use relative paths in the `video` field. Example: ```text ./ShareGPT4Video/zip_folder/panda/panda_videos_16/j5JqNXjGufw.mp4 ``` Every source file and exported shard uses the `_relative.jsonl` suffix to indicate that machine-local path prefixes have already been removed. ## Video access policy - This dataset repo contains JSONL only. - To use the exported `video` paths directly, download the original videos from the official source datasets below and place them under your own local root. - If a row contains `./ShareGPT4Video/...`, resolve it relative to your own storage root instead of expecting the repo to host the binary. Example local resolution: ```python import os from datasets import load_dataset sample = next(iter(load_dataset("tencent/Penguin-Recap-V", "sharegpt4video", split="train", streaming=True))) local_root = "/path/to/your/storage/root" video_path = os.path.join(local_root, sample["video"][0][2:]) print(video_path) ``` ## Source video download guidance - `ShareGPT4Video`: [project](https://sharegpt4video.github.io/), [data](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video). Use the official project page or the Hugging Face release as the entry point for obtaining the source videos and metadata. - `ShortVideo`: [project](https://github.com/tsinghua-fib-lab/ShortVideo_dataset), [data](https://github.com/tsinghua-fib-lab/ShortVideo_dataset). The official README links a sampled release and a tiny Dropbox version. Follow the README for the current access details. - `VIDAL-10M`: [project](https://github.com/pku-yuangroup/languagebind), [data](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/DATASETS.md). The official LanguageBind release documents VIDAL-10M via YouTube IDs and related metadata rather than redistributing the videos directly. - `FineVideo`: [project](https://github.com/huggingface/fineVideo), [data](https://huggingface.co/datasets/HuggingFaceFV/finevideo). The official dataset page is hosted on the Hugging Face Hub. Access may require accepting the dataset terms on the repo page. - `Multi-Moments in Time`: [project](http://moments.csail.mit.edu/), [data](http://moments.csail.mit.edu/). Request the dataset from the official Moments in Time site and use the accompanying papers for the original task definition. Additional references: - Multi-Moments in Time paper: https://arxiv.org/abs/1801.03150 - Multi-Moments in Time follow-up paper: https://arxiv.org/abs/1911.00232 ## Repository layout - `data/<subset>/*_relative.jsonl`: exported JSONL shards for each source - `manifest/files.jsonl`: shard-level example counts and byte estimates - `manifest/build_stats.json`: end-of-run summary ## Loading ```python from datasets import load_dataset dataset = load_dataset( "tencent/Penguin-Recap-V", "sharegpt4video", split="train", streaming=True, ) sample = next(iter(dataset)) print(sample["video"][0]) print(sample["conversations"][0]["value"]) ``` ## Citation ```bibtex @article{Penguin-VL, title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders}, author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang}, journal={arXiv preprint arXiv:2603.06569}, year={2026} } ```
提供机构:
tencent
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作