longvideo-reason

Name: longvideo-reason
Creator: maas
Published: 2026-01-07 15:13:24
License: 暂无描述

魔搭社区2026-01-07 更新2026-01-10 收录

下载链接：

https://modelscope.cn/datasets/LongVideo-Reason/longvideo-reason

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center" width="100%"> <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/long-rl-logo.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;"> </p> # Long-RL: Scaling RL to Long Sequences (Training, Validation and Test Dataset - for research only) [![Paper](https://img.shields.io/badge/Paper-Arvix%20Link-green)](https://arxiv.org/abs/2507.07966) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](https://github.com/NVlabs/Long-RL/blob/main/LICENSE) <div align="center"> [![Watch the video](https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/demo_video_first_frame.png)](https://www.youtube.com/watch?v=ykbblK2jiEg) </div> ## Data Distribution <p align="center" width="100%"> <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/data_distribution.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;"> </p> We strategically construct a high-quality dataset with CoT annotations for long video reasoning, named LongVideo-Reason. Leveraging a powerful VLM (NVILA-8B) and a leading open-source reasoning LLM, we develop a dataset comprising 52K high-quality Question-Reasoning-Answer pairs for long videos. We use 18K high-quality samples for Long-CoT-SFT to initialize the model's reasoning and instruction-following abilities, and 33K samples with an additional 110K video data for reinforcement learning. This two-stage training combines high-quality reasoning annotations with reinforcement learning, enabling LongVILA-R1 to achieve superior and generalized video reasoning. We also manually curate a balanced set of 1K long-video samples to build a new benchmark, LongVideo-Reason-eval, that evaluates performance from four perspectives: Temporal, Goal and Purpose, Spatial, and Plot and Narrative, for a comprehensive assessment. **LongVideo-Reason (Train, 52k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo-reason)]** **LongVideo-Reason-eval (Test, 1k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos)]** ## Installation ```bash git clone https://github.com/NVlabs/Long-RL.git cd Long-RL pip install -e . ``` If you want to train Qwen-Omni models, please ```bash bash vllm_replace.sh ``` ## Training ### Single node For single node (within 8 GPUs), you can refer to the training scripts in the `examples` directory. For example, ```bash bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH ``` ### Multi-nodes For jobs that requires multi-nodes, you can refer to the ways mentioned in the EasyR1 repo, [here](https://github.com/hiyouga/EasyR1/tree/main?tab=readme-ov-file#how-to-run-70b-model-in-multi-node-environment). We provide additional examples for `sbatch` scripts like, where `TRAIN_SCRIPT` is the script to train on single node, `NNODES` is the number of nodes required. ```bash bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES ``` For example, ```bash bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2 ``` ### Merge Checkpoint in Hugging Face Format This follows the ways in the EasyR1 repo. ```bash python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor ``` ## Evaluation We provide the instruction on evaluating models on our `LongVideo-Reason` benchmark in the `eval` [directory](https://github.com/NVlabs/Long-RL/tree/main/eval). ## How to contribute - Make sure to have git installed. - Create your own [fork](https://github.com/NVlabs/Long-RL/fork) of the project. - Clone the repository on your local machine, using git clone and pasting the url of this project. - Read both the `Installation` sections above. - Commit and push your changes. - Make a pull request when finished modifying the project. ## Core Contributors [Yukang Chen](https://yukangchen.com/), [Wei Huang](https://aaron-weihuang.com/), [Shuai Yang](https://andysonys.github.io), [Qinghao Hu](https://tonyhao.xyz/), [Baifeng Shi](https://bfshi.github.io/), [Hanrong Ye](https://sites.google.com/site/yhrspace/home), [Ligeng Zhu](https://lzhu.me/). We welcome all possible contributions and will acknowledge all contributors clearly. ## Citation Please consider to cite our paper and this framework, if they are helpful in your research. ```bibtex @misc{long-rl, title = {Long-RL: Scaling RL to Long Sequences}, author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/NVlabs/Long-RL}}, } ``` ```bibtex @article{chen2025longvila-r1, title={Scaling RL to Long Videos}, author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han}, year={2025}, eprint={2507.07966}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ```bibtex @inproceedings{chen2024longvila, title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos}, author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han}, booktitle={The International Conference on Learning Representations (ICLR)}, year={2025}, } ``` ## Acknowledgement - [EasyR1](https://github.com/hiyouga/EasyR1): the codebase we built upon. Thanks for their wonderful work. - [verl](https://github.com/volcengine/verl): the RL training framework we built upon. - [vllm](https://github.com/vllm-project/vllm): we built upon vllm for the rollout engine. - [Flow-GRPO](https://github.com/yifan123/flow_grpo): we refer to the Flow-GRPO for the image/video generation RL part. - [Shot2story](https://arxiv.org/abs/2312.10300): we curate 18K long videos from the Shot2Story.

<p align="center" width="100%"> <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/long-rl-logo.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;"> </p> # Long-RL：将强化学习扩展至长序列（训练、验证与测试数据集——仅用于研究） [![Paper](https://img.shields.io/badge/Paper-Arvix%20Link-green)](https://arxiv.org/abs/2507.07966) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](https://github.com/NVlabs/Long-RL/blob/main/LICENSE) <div align="center"> [![Watch the video](https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/demo_video_first_frame.png)](https://www.youtube.com/watch?v=ykbblK2jiEg) </div> ## 数据分布 <p align="center" width="100%"> <img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/data_distribution.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;"> </p> 我们精心构建了一款高质量长视频推理数据集，附带思维链（Chain of Thought, CoT）标注，命名为LongVideo-Reason。我们借助高性能视觉语言模型（Vision-Language Model, VLM）NVILA-8B，以及顶尖的开源推理大语言模型（Large Language Model, LLM），打造了包含52K条高质量长视频问答-推理-答案对的数据集。其中，我们使用18K高质量样本用于Long-CoT-SFT，以初始化模型的推理与指令遵循能力；另有33K样本搭配额外110K视频数据，用于强化学习（Reinforcement Learning, RL）。这种两阶段训练将高质量推理标注与强化学习相结合，使得LongVILA-R1能够实现卓越且泛化性更强的视频推理能力。此外，我们手动筛选出1K条平衡的长视频样本，构建了全新基准LongVideo-Reason-eval，从时间性、目标与意图、空间性以及情节与叙事四个维度评估模型性能，实现全面评测。 **LongVideo-Reason（训练集，52K）[[数据链接](https://huggingface.co/datasets/LongVideo-Reason/longvideo-reason)]** **LongVideo-Reason-eval（测试集，1K）[[数据链接](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos)]** ## 安装 bash git clone https://github.com/NVlabs/Long-RL.git cd Long-RL pip install -e . 若需训练Qwen-Omni模型，请执行： bash bash vllm_replace.sh ## 训练 ### 单节点训练针对单节点（8张GPU以内）场景，您可参考`examples`目录下的训练脚本。例如： bash bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH ### 多节点训练对于需要多节点的训练任务，您可参考EasyR1仓库中提及的方法，[此处](https://github.com/hiyouga/EasyR1/tree/main?tab=readme-ov-file#how-to-run-70b-model-in-multi-node-environment)。我们额外提供了适用于`sbatch`脚本的示例，其中`TRAIN_SCRIPT`为单节点训练脚本，`NNODES`为所需节点数： bash bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES 示例如下： bash bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2 ### 合并Hugging Face格式的检查点该步骤遵循EasyR1仓库中的方法： bash python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor ## 评测我们在`eval` [目录](https://github.com/NVlabs/Long-RL/tree/main/eval)中提供了在我们的`LongVideo-Reason`基准上评测模型的相关说明。 ## 贡献指南 1. 确保已安装Git。 2. 创建本项目的个人[分支复刻](https://github.com/NVlabs/Long-RL/fork)。 3. 使用`git clone`粘贴本项目URL，将仓库克隆至本地机器。 4. 阅读上文`安装`章节的相关说明。 5. 提交并推送您的修改。 6. 完成项目修改后，发起拉取请求（Pull Request）。 ## 核心贡献者 [Yukang Chen](https://yukangchen.com/), [Wei Huang](https://aaron-weihuang.com/), [Shuai Yang](https://andysonys.github.io), [Qinghao Hu](https://tonyhao.xyz/), [Baifeng Shi](https://bfshi.github.io/), [Hanrong Ye](https://sites.google.com/site/yhrspace/home), [Ligeng Zhu](https://lzhu.me/)。我们欢迎所有形式的贡献，并将明确致谢所有贡献者。 ## 引用若本项目对你的研究有所帮助，请考虑引用我们的论文与本框架。 bibtex @misc{long-rl, title = {Long-RL: Scaling RL to Long Sequences}, author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/NVlabs/Long-RL}}, } bibtex @article{chen2025longvila-r1, title={Scaling RL to Long Videos}, author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han}, year={2025}, eprint={2507.07966}, archivePrefix={arXiv}, primaryClass={cs.CV} } bibtex @inproceedings{chen2024longvila, title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos}, author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han}, booktitle={The International Conference on Learning Representations (ICLR)}, year={2025}, } ## 致谢 - [EasyR1](https://github.com/hiyouga/EasyR1): 本项目基于该代码库构建。感谢其出色的工作。 - [verl](https://github.com/volcengine/verl): 我们所使用的强化学习训练框架。 - [vllm](https://github.com/vllm-project/vllm): 我们基于其构建了推理引擎。 - [Flow-GRPO](https://github.com/yifan123/flow_grpo): 我们参考其工作实现了图像/视频生成强化学习部分。 - [Shot2story](https://arxiv.org/abs/2312.10300): 我们从Shot2Story中筛选出18K条长视频样本。

提供机构：

maas

创建时间：

2025-09-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集