five

MCG-NJU/LongVPO-Training-Data

收藏
Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/MCG-NJU/LongVPO-Training-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en base_model: - OpenGVLab/InternVL3-8B pipeline_tag: video-text-to-text library_name: transformers tags: - multimodal task_categories: - video-text-to-text - visual-question-answering size_categories: - 10K<n<100K configs: - config_name: stage1 data_files: - split: train path: InternVL3_stage1_short2long_training.jsonl default: true - config_name: stage2 data_files: - split: train path: InternVL3_stage2_long_training.jsonl --- # LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization [\[📂 GitHub\]](https://github.com/MCG-NJU/LongVPO) [\[📜 Paper\]](https://arxiv.org/abs/2602.02341) [\[🤗 Model\]](https://huggingface.co/MCG-NJU/LongVPO-Stage2-InternVL3-8B) ## ⚙️ Training Methodology & Data The training process of LongVPO is divided into two progressive stages, utilizing curated datasets to enhance both grounded understanding and complex reasoning: * **Stage 1: Anchored Cues Optimization** * **Objective:** To anchor the model's attention to critical temporal events and prevent attention drift over long contexts. * **Data & Method:** Utilizes short-to-long video alignment data sourced from [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). The preference optimization leverages anchored temporal cues (e.g., specific timestamps or keyframes) to teach the model how to locate and extract relevant information accurately before generating an answer. * **Stage 2: Self-Reasoning Optimization** * **Objective:** To internalize the reasoning process, allowing the model to autonomously connect multiple events across the video without relying on explicit external cues. * **Data & Method:** Focuses purely on long-form video datasets, utilizing [Vript](https://huggingface.co/datasets/Mutonix/Vript). The model is trained to generate its own reasoning chains (self-reasoning) to deduce the correct answers, aligning its outputs with human preference for logical and comprehensive long-video comprehension. ## 📜 Citation If you find this work helpful, please consider citing our paper: ```bibtex @inproceedings{huang2025longvpo, title={Long{VPO}: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization}, author={Zhenpeng Huang and Jiaqi Li and Zihan Jia and Xinhao Li and Desen Meng and Lingxue Song and Xi Chen and Liang Li and Limin Wang}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2025}, url={[https://openreview.net/forum?id=LKAp7Dknxf](https://openreview.net/forum?id=LKAp7Dknxf)} }
提供机构:
MCG-NJU
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作