MCG-NJU/LongVPO-Training-Data

Name: MCG-NJU/LongVPO-Training-Data
Creator: MCG-NJU
Published: 2026-03-07 13:03:09
License: 暂无描述

Hugging Face2026-03-07 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/MCG-NJU/LongVPO-Training-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en base_model: - OpenGVLab/InternVL3-8B pipeline_tag: video-text-to-text library_name: transformers tags: - multimodal task_categories: - video-text-to-text - visual-question-answering size_categories: - 10K<n<100K configs: - config_name: stage1 data_files: - split: train path: InternVL3_stage1_short2long_training.jsonl default: true - config_name: stage2 data_files: - split: train path: InternVL3_stage2_long_training.jsonl --- # LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization [\[📂 GitHub\]](https://github.com/MCG-NJU/LongVPO) [\[📜 Paper\]](https://arxiv.org/abs/2602.02341) [\[🤗 Model\]](https://huggingface.co/MCG-NJU/LongVPO-Stage2-InternVL3-8B) ## ⚙️ Training Methodology & Data The training process of LongVPO is divided into two progressive stages, utilizing curated datasets to enhance both grounded understanding and complex reasoning: * **Stage 1: Anchored Cues Optimization** * **Objective:** To anchor the model's attention to critical temporal events and prevent attention drift over long contexts. * **Data & Method:** Utilizes short-to-long video alignment data sourced from [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). The preference optimization leverages anchored temporal cues (e.g., specific timestamps or keyframes) to teach the model how to locate and extract relevant information accurately before generating an answer. * **Stage 2: Self-Reasoning Optimization** * **Objective:** To internalize the reasoning process, allowing the model to autonomously connect multiple events across the video without relying on explicit external cues. * **Data & Method:** Focuses purely on long-form video datasets, utilizing [Vript](https://huggingface.co/datasets/Mutonix/Vript). The model is trained to generate its own reasoning chains (self-reasoning) to deduce the correct answers, aligning its outputs with human preference for logical and comprehensive long-video comprehension. ## 📜 Citation If you find this work helpful, please consider citing our paper: ```bibtex @inproceedings{huang2025longvpo, title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization}, author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2025}, url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)} }

提供机构：

MCG-NJU

5,000+

优质数据集

54 个

任务类型

进入经典数据集