longvideo_eval_videos
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/LongVideo-Reason/longvideo_eval_videos
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center" width="100%">
<img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/long-rl-logo.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>
# Long-RL: Scaling RL to Long Sequences (Evaluation Dataset - for research only)
[](https://arxiv.org/abs/2507.07966)
[](https://github.com/NVlabs/Long-RL/blob/main/LICENSE)
<div align="center">
[](https://www.youtube.com/watch?v=ykbblK2jiEg)
</div>
## Data Distribution
<p align="center" width="100%">
<img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/data_distribution.png" alt="Stanford-Alpaca" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>
We strategically construct a high-quality dataset with CoT annotations for long video reasoning, named LongVideo-Reason. Leveraging a powerful VLM (NVILA-8B) and a leading open-source reasoning LLM, we develop a dataset comprising 52K high-quality Question-Reasoning-Answer pairs for long videos. We use 18K high-quality samples for Long-CoT-SFT to initialize the model's reasoning and instruction-following abilities, and 33K samples with an additional 110K video data for reinforcement learning. This two-stage training combines high-quality reasoning annotations with reinforcement learning, enabling LongVILA-R1 to achieve superior and generalized video reasoning. We also manually curate a balanced set of 1K long-video samples to build a new benchmark, LongVideo-Reason-eval, that evaluates performance from four perspectives: Temporal, Goal and Purpose, Spatial, and Plot and Narrative, for a comprehensive assessment.
**LongVideo-Reason (Train, 52k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo-reason)]**
**LongVideo-Reason-eval (Test, 1k) [[Data Link](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos)]**
## Installation
```bash
git clone https://github.com/NVlabs/Long-RL.git
cd Long-RL
pip install -e .
```
If you want to train Qwen-Omni models, please
```bash
bash vllm_replace.sh
```
## Training
### Single node
For single node (within 8 GPUs), you can refer to the training scripts in the `examples` directory. For example,
```bash
bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH
```
### Multi-nodes
For jobs that requires multi-nodes, you can refer to the ways mentioned in the EasyR1 repo, [here](https://github.com/hiyouga/EasyR1/tree/main?tab=readme-ov-file#how-to-run-70b-model-in-multi-node-environment).
We provide additional examples for `sbatch` scripts like, where `TRAIN_SCRIPT` is the script to train on single node, `NNODES` is the number of nodes required.
```bash
bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES
```
For example,
```bash
bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2
```
### Merge Checkpoint in Hugging Face Format
This follows the ways in the EasyR1 repo.
```bash
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
```
## Evaluation
We provide the instruction on evaluating models on our `LongVideo-Reason` benchmark in the `eval` [directory](https://github.com/NVlabs/Long-RL/tree/main/eval).
## Testing on LongVideo-Reason-eval
In this section, we release the scripts for testing on our LongVideo-Reason-eval set. More details about the training set can be found [here](https://github.com/NVlabs/Long-RL/issues/1).
You can find the videos for testing [here](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos/tree/main). Please download them, and `tar -zxvf` them into a directory named `longvila_videos`.
```
├── $VIDEO_DIR
│ ├── longvila_videos
│ │ │── mp4/webm/mkv videos
```
`$VIDEO_DIR` is the parent directory of `longvila_videos`. For different models, you need to customize the `model_generate` function accordingly. The model generations and output metrics will be saved in `runs_${$MODEL_PATH}`.
```bash
python eval.py \
--model-path $MODEL_PATH \
--data-path LongVideo-Reason/longvideo-reason@test \
--video-dir $VIDEO_DIR \
--output-dir runs_${$MODEL_PATH}
```
## Core Contributors
[Yukang Chen](https://yukangchen.com/), [Wei Huang](https://aaron-weihuang.com/), [Shuai Yang](https://andysonys.github.io), [Qinghao Hu](https://tonyhao.xyz/), [Baifeng Shi](https://bfshi.github.io/), [Hanrong Ye](https://sites.google.com/site/yhrspace/home), [Ligeng Zhu](https://lzhu.me/).
We welcome all possible contributions and will acknowledge all contributors clearly.
## Citation
Please consider to cite our paper and this framework, if they are helpful in your research.
```bibtex
@misc{long-rl,
title = {Long-RL: Scaling RL to Long Sequences},
author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/NVlabs/Long-RL}},
}
```
```bibtex
@article{chen2025longvila-r1,
title={Scaling RL to Long Videos},
author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},
year={2025},
eprint={2507.07966},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
```bibtex
@inproceedings{chen2024longvila,
title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
booktitle={The International Conference on Learning Representations (ICLR)},
year={2025},
}
```
## Acknowledgement
- [EasyR1](https://github.com/hiyouga/EasyR1): the codebase we built upon. Thanks for their wonderful work.
- [verl](https://github.com/volcengine/verl): the RL training framework we built upon.
- [vllm](https://github.com/vllm-project/vllm): we built upon vllm for the rollout engine.
- [Flow-GRPO](https://github.com/yifan123/flow_grpo): we refer to the Flow-GRPO for the image/video generation RL part.
- [Shot2story](https://arxiv.org/abs/2312.10300): we curate 18K long videos from the Shot2Story.
<p align="center" width="100%">
<img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/long-rl-logo.png" alt="Long-RL 项目标识" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>
# Long-RL:将强化学习扩展至长序列(评估数据集——仅用于科研)
[](https://arxiv.org/abs/2507.07966)
[](https://github.com/NVlabs/Long-RL/blob/main/LICENSE)
<div align="center">
[](https://www.youtube.com/watch?v=ykbblK2jiEg)
</div>
## 数据分布
<p align="center" width="100%">
<img src="https://raw.githubusercontent.com/NVlabs/Long-RL/main/assets/data_distribution.png" alt="数据分布示意图" style="width: 100%; min-width: 300px; display: block; margin: auto;">
</p>
我们精心构建了一款面向长视频推理的高质量数据集,附带思维链(Chain of Thought, CoT)标注,命名为LongVideo-Reason。我们依托高性能视觉语言模型(Vision-Language Model, VLM)NVILA-8B与顶尖开源推理大语言模型(Large Language Model, LLM),构建了包含52,000条高质量长视频问答-推理-答案三元组。我们选取18,000条高质量样本用于Long-CoT-SFT(长序列思维链监督微调),以初始化模型的推理与指令遵循能力;同时选取33,000条样本结合额外110,000条视频数据,用于强化学习训练。这种两阶段训练将高质量推理标注与强化学习相结合,使得LongVILA-R1能够实现优异且泛化性强的长视频推理能力。我们还手动整理了包含1,000条长视频样本的均衡数据集,构建了全新基准LongVideo-Reason-eval,从时序、目标与意图、空间结构以及情节与叙事四个维度评估模型性能,实现全面的性能评测。
**LongVideo-Reason(训练集,52k)[[数据链接](https://huggingface.co/datasets/LongVideo-Reason/longvideo-reason)]**
**LongVideo-Reason-eval(测试集,1k)[[数据链接](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos)]**
## 安装方法
bash
git clone https://github.com/NVlabs/Long-RL.git
cd Long-RL
pip install -e .
若需训练Qwen-Omni模型,请执行:
bash
bash vllm_replace.sh
## 训练流程
### 单节点训练
针对单节点(最多8张GPU)的训练,可参考`examples`目录下的训练脚本。例如:
bash
bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH
### 多节点训练
对于多节点训练任务,可参考EasyR1代码库中提及的实现方式,[详见此处](https://github.com/hiyouga/EasyR1/tree/main?tab=readme-ov-file#how-to-run-70b-model-in-multi-node-environment)。
我们还提供了`sbatch`脚本的额外示例,其中`TRAIN_SCRIPT`为单节点训练脚本,`NNODES`为所需节点数:
bash
bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES
例如:
bash
bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2
### Hugging Face格式 Checkpoint 合并
该流程可参照EasyR1代码库中的方法实现:
bash
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
## 模型评估
我们在`eval`[目录](https://github.com/NVlabs/Long-RL/tree/main/eval)中提供了基于`LongVideo-Reason`基准评估模型性能的操作指南。
## 在LongVideo-Reason-eval基准上测试
本节提供了用于在LongVideo-Reason-eval测试集上运行测试的脚本。关于训练集的更多细节可参见[此处](https://github.com/NVlabs/Long-RL/issues/1)。
测试所需的视频文件可从[此处](https://huggingface.co/datasets/LongVideo-Reason/longvideo_eval_videos/tree/main)获取。请下载后使用`tar -zxvf`命令将其解压至名为`longvila_videos`的目录中:
├── $VIDEO_DIR
│ ├── longvila_videos
│ │ │── mp4/webm/mkv格式视频文件
`$VIDEO_DIR`为`longvila_videos`的父目录。针对不同模型,你需要自行定制`model_generate`函数。模型生成结果与输出指标将保存至`runs_${MODEL_PATH}`目录中:
bash
python eval.py
--model-path $MODEL_PATH
--data-path LongVideo-Reason/longvideo-reason@test
--video-dir $VIDEO_DIR
--output-dir runs_${MODEL_PATH}
## 核心贡献者
[陈宇康](https://yukangchen.com/), [黄伟](https://aaron-weihuang.com/), [杨帅](https://andysonys.github.io), [胡庆浩](https://tonyhao.xyz/), [石柏峰](https://bfshi.github.io/), [叶汉荣](https://sites.google.com/site/yhrspace/home), [朱立耕](https://lzhu.me/).
我们欢迎各类形式的贡献,并将明确致谢所有贡献者。
## 引用
若本项目对你的研究有所帮助,请考虑引用我们的论文与本框架:
bibtex
@misc{long-rl,
title = {Long-RL: Scaling RL to Long Sequences},
author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {url{https://github.com/NVlabs/Long-RL}},
}
bibtex
@article{chen2025longvila-r1,
title={Scaling RL to Long Videos},
author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},
year={2025},
eprint={2507.07966},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
bibtex
@inproceedings{chen2024longvila,
title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
booktitle={The International Conference on Learning Representations (ICLR)},
year={2025},
}
## 致谢
- [EasyR1](https://github.com/hiyouga/EasyR1):本项目基于该代码库开发,感谢其出色的工作。
- [verl](https://github.com/volcengine/verl):我们所使用的强化学习训练框架。
- [vllm](https://github.com/vllm-project/vllm):我们基于该项目实现了推理引擎。
- [Flow-GRPO](https://github.com/yifan123/flow_grpo):我们在图像/视频生成强化学习部分参考了该项目的实现。
- [Shot2story](https://arxiv.org/abs/2312.10300):我们从Shot2Story数据集中整理得到18,000条长视频样本。
提供机构:
maas
创建时间:
2025-09-04



