MCG-NJU/LongVPO-Training-Data
收藏Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/MCG-NJU/LongVPO-Training-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
base_model:
- OpenGVLab/InternVL3-8B
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- multimodal
task_categories:
- video-text-to-text
- visual-question-answering
size_categories:
- 10K<n<100K
configs:
- config_name: stage1
data_files:
- split: train
path: InternVL3_stage1_short2long_training.jsonl
default: true
- config_name: stage2
data_files:
- split: train
path: InternVL3_stage2_long_training.jsonl
---
# LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
[\[📂 GitHub\]](https://github.com/MCG-NJU/LongVPO) [\[📜 Paper\]](https://arxiv.org/abs/2602.02341) [\[🤗 Model\]](https://huggingface.co/MCG-NJU/LongVPO-Stage2-InternVL3-8B)
## ⚙️ Training Methodology & Data
The training process of LongVPO is divided into two progressive stages, utilizing curated datasets to enhance both grounded understanding and complex reasoning:
* **Stage 1: Anchored Cues Optimization**
* **Objective:** To anchor the model's attention to critical temporal events and prevent attention drift over long contexts.
* **Data & Method:** Utilizes short-to-long video alignment data sourced from [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). The preference optimization leverages anchored temporal cues (e.g., specific timestamps or keyframes) to teach the model how to locate and extract relevant information accurately before generating an answer.
* **Stage 2: Self-Reasoning Optimization**
* **Objective:** To internalize the reasoning process, allowing the model to autonomously connect multiple events across the video without relying on explicit external cues.
* **Data & Method:** Focuses purely on long-form video datasets, utilizing [Vript](https://huggingface.co/datasets/Mutonix/Vript). The model is trained to generate its own reasoning chains (self-reasoning) to deduce the correct answers, aligning its outputs with human preference for logical and comprehensive long-video comprehension.
## 📜 Citation
If you find this work helpful, please consider citing our paper:
```bibtex
@inproceedings{huang2025longvpo,
title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization},
author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)}
}
提供机构:
MCG-NJU



