five

VideoEval-Pro

收藏
魔搭社区2026-01-06 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VideoEval-Pro
下载链接
链接失效反馈
官方服务:
资源简介:
# VideoEval-Pro VideoEval-Pro is a robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. The dataset is constructed by reformatting questions from four existing long video understanding MCQ benchmarks: Video-MME, MLVU, LVBench, and LongVideoBench into free-form questions. The paper can be found [here](https://huggingface.co/papers/2505.14640). The evaluation code and scripts are available at: [TIGER-AI-Lab/VideoEval-Pro](https://github.com/TIGER-AI-Lab/VideoEval-Pro) ## Dataset Structure Each example in the dataset contains: - `video`: Name (path) of the video file - `question`: The question about the video content - `options`: Original options from the source benchmark - `answer`: The correct MCQ answer - `answer_text`: The correct free-form answer - `meta`: Additional metadata from the source benchmark - `source`: Source benchmark - `qa_subtype`: Question task subtype - `qa_type`: Question task type ## Evaluation Steps 1. **Download and Prepare Videos** ```bash # Navigate to videos directory cd videos # Merge all split tar.gz files into a single archive cat videos_part_*.tar.gz > videos_merged.tar.gz # Extract the merged archive tar -xzf videos_merged.tar.gz # [Optional] Clean up the split files and merged archive rm videos_part_*.tar.gz videos_merged.tar.gz # After extraction, you will get a directory containing all videos # The path to this directory will be used as --video_root in evaluation # For example: 'VideoEval-Pro/videos' ``` 2. **[Optional] Pre-extract Frames** To improve efficiency, you can pre-extract frames from videos. The extracted frames should be organized as follows: ``` frames_root/ ├── video_name_1/ # Directory name is thevideo name │ ├── 000001.jpg # Frame images │ ├── 000002.jpg │ └── ... ├── video_name_2/ │ ├── 000001.jpg │ ├── 000002.jpg │ └── ... └── ... ``` After frame extraction, the path to the frames will be used as `--frames_root`. Set `--using_frames True` when running the evaluation script. 3. **Setup Evaluation Environment** ```bash # Clone the repository from the GitHub repository git clone https://github.com/TIGER-AI-Lab/VideoEval-Pro cd VideoEval-Pro # Create conda environment from requirements.txt (there are different requirements files for different models) conda create -n videoevalpro --file requirements.txt conda activate videoevalpro ``` 4. **Run Evaluation** ```bash cd VideoEval-Pro # Set PYTHONPATH export PYTHONPATH=. # Run evaluation script with the following parameters: # --video_root: Path to video files folder # --frames_root: Path to video frames folder [For using_frames] # --output_path: Path to save output results # --using_frames: Whether to use pre-extracted frames # --model_path: Path to model # --device: Device to run inference on # --num_frames: Number of frames to sample from video # --max_retries: Maximum number of retries for failed inference # --num_threads: Number of threads for parallel processing python tools/*_chat.py \ --video_root <path_to_videos> \ --frames_root <path_to_frames> \ --output_path <path_to_save_results> \ --using_frames <True/False> \ --model_path <model_name_or_path> \ --device <device> \ --num_frames <number_of_frames> \ --max_retries <max_retries> \ --num_threads <num_threads> E.g.: python tools/qwen_chat.py \ --video_root ./videos \ --frames_root ./frames \ --output_path ./results/qwen_results.jsonl \ --using_frames False \ --model_path Qwen/Qwen2-VL-7B-Instruct \ --device cuda \ --num_frames 32 \ --max_retries 10 \ --num_threads 1 ``` 5. **Judge the results** ```bash cd VideoEval-Pro # Set PYTHONPATH export PYTHONPATH=. # Run judge script *gpt4o_judge.py* with the following parameters: # --input_path: Path to save output results # --output_path: Path to judged results # --model_name: Version of the judge model # --num_threads: Number of threads for parallel processing python tools/gpt4o_judge.py \ --input_path <path_to_saved_results> \ --output_path <path_to_judged_results> \ --model_name <model_version> \ --num_threads <num_threads> E.g.: python tools/gpt4o_judge.py \ --input_path ./results/qwen_results.jsonl \ --output_path ./results/qwen_results_judged.jsonl \ --model_name gpt-4o-2024-08-06 \ --num_threads 1 ``` **Note: the released results are judged by *gpt-4o-2024-08-06***

# VideoEval-Pro VideoEval-Pro 是一款鲁棒且贴合真实场景的长视频理解基准测试集,包含开放式短问答类问答任务。该数据集通过将四个现有长视频理解多项选择题(Multiple Choice Question, MCQ)基准——Video-MME、MLVU、LVBench 与 LongVideoBench 中的题目重构为自由格式问答问题构建而成。相关论文可访问 [https://huggingface.co/papers/2505.14640](https://huggingface.co/papers/2505.14640) 查阅。 评估代码与脚本可于 [TIGER-AI-Lab/VideoEval-Pro](https://github.com/TIGER-AI-Lab/VideoEval-Pro) 获取。 ## 数据集结构 数据集中的每个样本包含以下字段: - `video`:视频文件的名称(路径) - `question`:针对视频内容提出的问题 - `options`:来源基准测试中的原始选项 - `answer`:多项选择题的正确答案 - `answer_text`:正确的自由格式答案 - `meta`:来源基准测试附带的额外元数据 - `source`:所属基准测试来源 - `qa_subtype`:问答任务子类型 - `qa_type`:问答任务类型 ## 评估流程 ### 1. 下载并准备视频 bash # 切换至视频目录 cd videos # 将所有分卷 tar.gz 文件合并为单个归档文件 cat videos_part_*.tar.gz > videos_merged.tar.gz # 解压合并后的归档文件 tar -xzf videos_merged.tar.gz # [可选] 清理分卷文件与合并后的归档文件 rm videos_part_*.tar.gz videos_merged.tar.gz # 解压完成后,将得到包含全部视频文件的目录 # 该目录的路径将作为评估时的 --video_root 参数值 # 示例:'VideoEval-Pro/videos' ### 2. [可选] 预提取视频帧 为提升评估效率,可预先从视频中提取帧图像。提取后的帧需按照如下格式组织: frames_root/ ├── video_name_1/ # 目录名与视频文件名一致 │ ├── 000001.jpg # 帧图像文件 │ ├── 000002.jpg │ └── ... ├── video_name_2/ │ ├── 000001.jpg │ ├── 000002.jpg │ └── ... └── ... 完成帧提取后,该帧目录的路径将作为 `--frames_root` 参数值。运行评估脚本时需将 `--using_frames` 参数设为 `True`。 ### 3. 搭建评估环境 bash # 从 GitHub 仓库克隆项目 git clone https://github.com/TIGER-AI-Lab/VideoEval-Pro cd VideoEval-Pro # 根据 requirements.txt 创建 Conda 环境(不同模型对应不同的依赖文件) conda create -n videoevalpro --file requirements.txt conda activate videoevalpro ### 4. 运行评估 bash cd VideoEval-Pro # 设置 PYTHONPATH 环境变量 export PYTHONPATH=. # 执行评估脚本,需传入以下参数: # --video_root:视频文件目录路径 # --frames_root:预提取视频帧的目录路径 [仅当使用预提取帧时需传入] # --output_path:评估结果保存路径 # --using_frames:是否使用预提取的视频帧 # --model_path:模型路径或模型名称 # --device:模型推理所用设备 # --num_frames:从视频中采样的帧数 # --max_retries:推理失败时的最大重试次数 # --num_threads:并行处理所用线程数 python tools/*_chat.py --video_root <path_to_videos> --frames_root <path_to_frames> --output_path <path_to_save_results> --using_frames <True/False> --model_path <model_name_or_path> --device <device> --num_frames <number_of_frames> --max_retries <max_retries> --num_threads <num_threads> 示例: python tools/qwen_chat.py --video_root ./videos --frames_root ./frames --output_path ./results/qwen_results.jsonl --using_frames False --model_path Qwen/Qwen2-VL-7B-Instruct --device cuda --num_frames 32 --max_retries 10 --num_threads 1 ### 5. 结果评判 bash cd VideoEval-Pro # 设置 PYTHONPATH 环境变量 export PYTHONPATH=. # 执行评判脚本 *gpt4o_judge.py*,需传入以下参数: # --input_path:评估结果文件路径 # --output_path:评判结果保存路径 # --model_name:评判模型的版本 # --num_threads:并行处理所用线程数 python tools/gpt4o_judge.py --input_path <path_to_saved_results> --output_path <path_to_judged_results> --model_name <model_version> --num_threads <num_threads> 示例: python tools/gpt4o_judge.py --input_path ./results/qwen_results.jsonl --output_path ./results/qwen_results_judged.jsonl --model_name gpt-4o-2024-08-06 --num_threads 1 **注意:本次发布的评估结果均通过 *gpt-4o-2024-08-06* 模型完成评判**
提供机构:
maas
创建时间:
2025-05-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作