VideoEval-Pro
收藏魔搭社区2026-01-06 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VideoEval-Pro
下载链接
链接失效反馈官方服务:
资源简介:
# VideoEval-Pro
VideoEval-Pro is a robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. The dataset is constructed by reformatting questions from four existing long video understanding MCQ benchmarks: Video-MME, MLVU, LVBench, and LongVideoBench into free-form questions. The paper can be found [here](https://huggingface.co/papers/2505.14640).
The evaluation code and scripts are available at: [TIGER-AI-Lab/VideoEval-Pro](https://github.com/TIGER-AI-Lab/VideoEval-Pro)
## Dataset Structure
Each example in the dataset contains:
- `video`: Name (path) of the video file
- `question`: The question about the video content
- `options`: Original options from the source benchmark
- `answer`: The correct MCQ answer
- `answer_text`: The correct free-form answer
- `meta`: Additional metadata from the source benchmark
- `source`: Source benchmark
- `qa_subtype`: Question task subtype
- `qa_type`: Question task type
## Evaluation Steps
1. **Download and Prepare Videos**
```bash
# Navigate to videos directory
cd videos
# Merge all split tar.gz files into a single archive
cat videos_part_*.tar.gz > videos_merged.tar.gz
# Extract the merged archive
tar -xzf videos_merged.tar.gz
# [Optional] Clean up the split files and merged archive
rm videos_part_*.tar.gz videos_merged.tar.gz
# After extraction, you will get a directory containing all videos
# The path to this directory will be used as --video_root in evaluation
# For example: 'VideoEval-Pro/videos'
```
2. **[Optional] Pre-extract Frames**
To improve efficiency, you can pre-extract frames from videos. The extracted frames should be organized as follows:
```
frames_root/
├── video_name_1/ # Directory name is thevideo name
│ ├── 000001.jpg # Frame images
│ ├── 000002.jpg
│ └── ...
├── video_name_2/
│ ├── 000001.jpg
│ ├── 000002.jpg
│ └── ...
└── ...
```
After frame extraction, the path to the frames will be used as `--frames_root`. Set `--using_frames True` when running the evaluation script.
3. **Setup Evaluation Environment**
```bash
# Clone the repository from the GitHub repository
git clone https://github.com/TIGER-AI-Lab/VideoEval-Pro
cd VideoEval-Pro
# Create conda environment from requirements.txt (there are different requirements files for different models)
conda create -n videoevalpro --file requirements.txt
conda activate videoevalpro
```
4. **Run Evaluation**
```bash
cd VideoEval-Pro
# Set PYTHONPATH
export PYTHONPATH=.
# Run evaluation script with the following parameters:
# --video_root: Path to video files folder
# --frames_root: Path to video frames folder [For using_frames]
# --output_path: Path to save output results
# --using_frames: Whether to use pre-extracted frames
# --model_path: Path to model
# --device: Device to run inference on
# --num_frames: Number of frames to sample from video
# --max_retries: Maximum number of retries for failed inference
# --num_threads: Number of threads for parallel processing
python tools/*_chat.py \
--video_root <path_to_videos> \
--frames_root <path_to_frames> \
--output_path <path_to_save_results> \
--using_frames <True/False> \
--model_path <model_name_or_path> \
--device <device> \
--num_frames <number_of_frames> \
--max_retries <max_retries> \
--num_threads <num_threads>
E.g.:
python tools/qwen_chat.py \
--video_root ./videos \
--frames_root ./frames \
--output_path ./results/qwen_results.jsonl \
--using_frames False \
--model_path Qwen/Qwen2-VL-7B-Instruct \
--device cuda \
--num_frames 32 \
--max_retries 10 \
--num_threads 1
```
5. **Judge the results**
```bash
cd VideoEval-Pro
# Set PYTHONPATH
export PYTHONPATH=.
# Run judge script *gpt4o_judge.py* with the following parameters:
# --input_path: Path to save output results
# --output_path: Path to judged results
# --model_name: Version of the judge model
# --num_threads: Number of threads for parallel processing
python tools/gpt4o_judge.py \
--input_path <path_to_saved_results> \
--output_path <path_to_judged_results> \
--model_name <model_version> \
--num_threads <num_threads>
E.g.:
python tools/gpt4o_judge.py \
--input_path ./results/qwen_results.jsonl \
--output_path ./results/qwen_results_judged.jsonl \
--model_name gpt-4o-2024-08-06 \
--num_threads 1
```
**Note: the released results are judged by *gpt-4o-2024-08-06***
# VideoEval-Pro
VideoEval-Pro 是一款鲁棒且贴合真实场景的长视频理解基准测试集,包含开放式短问答类问答任务。该数据集通过将四个现有长视频理解多项选择题(Multiple Choice Question, MCQ)基准——Video-MME、MLVU、LVBench 与 LongVideoBench 中的题目重构为自由格式问答问题构建而成。相关论文可访问 [https://huggingface.co/papers/2505.14640](https://huggingface.co/papers/2505.14640) 查阅。
评估代码与脚本可于 [TIGER-AI-Lab/VideoEval-Pro](https://github.com/TIGER-AI-Lab/VideoEval-Pro) 获取。
## 数据集结构
数据集中的每个样本包含以下字段:
- `video`:视频文件的名称(路径)
- `question`:针对视频内容提出的问题
- `options`:来源基准测试中的原始选项
- `answer`:多项选择题的正确答案
- `answer_text`:正确的自由格式答案
- `meta`:来源基准测试附带的额外元数据
- `source`:所属基准测试来源
- `qa_subtype`:问答任务子类型
- `qa_type`:问答任务类型
## 评估流程
### 1. 下载并准备视频
bash
# 切换至视频目录
cd videos
# 将所有分卷 tar.gz 文件合并为单个归档文件
cat videos_part_*.tar.gz > videos_merged.tar.gz
# 解压合并后的归档文件
tar -xzf videos_merged.tar.gz
# [可选] 清理分卷文件与合并后的归档文件
rm videos_part_*.tar.gz videos_merged.tar.gz
# 解压完成后,将得到包含全部视频文件的目录
# 该目录的路径将作为评估时的 --video_root 参数值
# 示例:'VideoEval-Pro/videos'
### 2. [可选] 预提取视频帧
为提升评估效率,可预先从视频中提取帧图像。提取后的帧需按照如下格式组织:
frames_root/
├── video_name_1/ # 目录名与视频文件名一致
│ ├── 000001.jpg # 帧图像文件
│ ├── 000002.jpg
│ └── ...
├── video_name_2/
│ ├── 000001.jpg
│ ├── 000002.jpg
│ └── ...
└── ...
完成帧提取后,该帧目录的路径将作为 `--frames_root` 参数值。运行评估脚本时需将 `--using_frames` 参数设为 `True`。
### 3. 搭建评估环境
bash
# 从 GitHub 仓库克隆项目
git clone https://github.com/TIGER-AI-Lab/VideoEval-Pro
cd VideoEval-Pro
# 根据 requirements.txt 创建 Conda 环境(不同模型对应不同的依赖文件)
conda create -n videoevalpro --file requirements.txt
conda activate videoevalpro
### 4. 运行评估
bash
cd VideoEval-Pro
# 设置 PYTHONPATH 环境变量
export PYTHONPATH=.
# 执行评估脚本,需传入以下参数:
# --video_root:视频文件目录路径
# --frames_root:预提取视频帧的目录路径 [仅当使用预提取帧时需传入]
# --output_path:评估结果保存路径
# --using_frames:是否使用预提取的视频帧
# --model_path:模型路径或模型名称
# --device:模型推理所用设备
# --num_frames:从视频中采样的帧数
# --max_retries:推理失败时的最大重试次数
# --num_threads:并行处理所用线程数
python tools/*_chat.py
--video_root <path_to_videos>
--frames_root <path_to_frames>
--output_path <path_to_save_results>
--using_frames <True/False>
--model_path <model_name_or_path>
--device <device>
--num_frames <number_of_frames>
--max_retries <max_retries>
--num_threads <num_threads>
示例:
python tools/qwen_chat.py
--video_root ./videos
--frames_root ./frames
--output_path ./results/qwen_results.jsonl
--using_frames False
--model_path Qwen/Qwen2-VL-7B-Instruct
--device cuda
--num_frames 32
--max_retries 10
--num_threads 1
### 5. 结果评判
bash
cd VideoEval-Pro
# 设置 PYTHONPATH 环境变量
export PYTHONPATH=.
# 执行评判脚本 *gpt4o_judge.py*,需传入以下参数:
# --input_path:评估结果文件路径
# --output_path:评判结果保存路径
# --model_name:评判模型的版本
# --num_threads:并行处理所用线程数
python tools/gpt4o_judge.py
--input_path <path_to_saved_results>
--output_path <path_to_judged_results>
--model_name <model_version>
--num_threads <num_threads>
示例:
python tools/gpt4o_judge.py
--input_path ./results/qwen_results.jsonl
--output_path ./results/qwen_results_judged.jsonl
--model_name gpt-4o-2024-08-06
--num_threads 1
**注意:本次发布的评估结果均通过 *gpt-4o-2024-08-06* 模型完成评判**
提供机构:
maas
创建时间:
2025-05-16



