MotionBench
收藏魔搭社区2026-05-02 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ZhipuAI/MotionBench
下载链接
链接失效反馈官方服务:
资源简介:
## 🔥 News
* **`2025.01.06`** 🌟 We released MotionBench, a new benchmark for fine-grained motion comprehension!
## Introduction
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability — fine-grained motion comprehension — remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.
### Features
1. **Core Capabilities**: Six core capabilities for fine-grained motion understanding, enabling the evaluation of motion-level perception.
2. **Diverse Data**: MotionBench collects diverse video from the web, public datasets, and self-synthetic videos generated via Unity3, capturing a broad distribution of real-world
application.
3. **High-Quality Annotations**: Reliable benchmark with meticulous human annotation and multi-stage quality control processes.
<p align="center">
<img src="./docs/image2.png" width="50%" height="20%">
</p>
## Dataset
### License
Our dataset is under the CC-BY-NC-SA-4.0 license.
LVBench is only used for academic research. Commercial use in any form is prohibited. We do not own the copyright of any raw video files.
If there is any infringement in MotionBench, please contact shiyu.huang@aminer.cn or directly raise an issue, and we will remove it immediately.
### Download
Install video2dataset first:
```shell
pip install video2dataset
pip uninstall transformer-engine
```
Then you should download `video_info.meta.jsonl` from [Huggingface](https://huggingface.co/datasets/THUDM/MotionBench) and
put it in the `data` directory.
Each entry in the `video_info.meta.jsonl` file contains a video sample. Some of the dataset has the ground truth answer (the DEV set) and some not (the TEST set). You could use the DEV set to optimize your dataset and upload the answer file to our [leaderboard](https://huggingface.co/spaces/THUDM/MotionBench) to see your model's performance.
#### Caption dataset
Part of our dataset are derived from our self-annotated detailed video caption. we additionally release a dataset of 5,000 videos with manually annotated fine-grained motion descriptions, which are annotated and double-checked together with the benchmark annotation process. Each video includes dynamic information descriptions with annotation density reaching 12.63 words per second, providing researchers with resources for further development and training to enhance video models’ motion-level comprehension capabilities.
#### Self-collected dataset
We provide the download link for all self-collected data.
#### Publically available dataset
For publically available data. we do not provide the orginal video files. You could download them from the original repo:
```
1. MedVid: https://github.com/deepaknlp/MedVidQACL
2. SportsSloMo: https://cvlab.cse.msu.edu/project-svw.html
3. HA-ViD: https://iai-hrc.github.io/ha-vid
```
After downloading the above mentioned dataset, find the mapping from the downloaded names to the filenames in our benchmark with the mapping file:
```
data/mapping.json
```
Then, cut the video to clips using the last two integers separated by `_`.
e.g., the video file `S10A13I22S1.mp4` is mapped to file `ef476626-3499-40c2-bbd6-5004223d1ada` according to the mapping file. To obtain the final test case `ef476626-3499-40c2-bbd6-5004223d1ada_58_59` in `video_info.meta.jsonl`, you should cut the video clip from `58` second to `59` second, yielding the final video sample for benchmarking.
## Install MotionBench
```shell
pip install -e .
```
## Get Evaluation Results and Submit to Leaderboard
(Note: if you want to try the evaluation quickly, you can use the `scripts/construct_random_answers.py` to prepare a
random answer file.)
```shell
cd scripts
python test_acc.py
```
After the execution, you will get an evaluation results file `random_answers.json` in the `scripts` directory. You can submit the
results to the [leaderboard](https://huggingface.co/spaces/THUDM/MotionBench).
## 📈 Results
- **Model Comparision:**
<p align="center">
<img src="./docs/tab3.png" width="96%" height="50%">
</p>
- **Benchmark Comparison:**
<p align="center">
<img src="./docs/image3.png" width="96%" height="50%">
</p>
- **Answer Distribution:**
<p align="center">
<img src="./docs/image5.png" width="96%" height="50%">
</p>
## Citation
If you find our work helpful for your research, please consider citing our work.
```bibtex
@misc{hong2024motionbench,
title={MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models},
author={Wenyi Hong and Yean Cheng and Zhuoyi Yang and Weihan Wang and Lefan Wang and Xiaotao Gu and Shiyu Huang and Yuxiao Dong and Jie Tang},
year={2024},
eprint={2501.02955},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## 🔥 最新动态
* **`2025.01.06`** 🌟 我们正式发布了MotionBench,一款面向细粒度运动理解的全新评测基准!
## 引言
近年来,视觉语言模型(Vision Language Models, VLMs)在视频理解领域取得了显著进展,但一项至关重要的能力——细粒度运动理解——在当前的评测基准中仍未得到充分探索。为填补这一空白,我们提出MotionBench:一款专为评估视频理解模型细粒度运动理解能力而设计的综合性评测基准。
### 核心特性
1. **核心能力维度**:涵盖六大细粒度运动理解核心能力,可实现运动级感知能力的精准评测。
2. **多元数据来源**:MotionBench从网络、公开数据集以及通过Unity3生成的自主合成视频中采集多样化样本,覆盖真实应用场景的广泛分布。
3. **高质量标注流程**:采用严谨的人工标注与多阶段质量管控机制,打造可靠的基准数据集。
<p align="center">
<img src="./docs/image2.png" width="50%" height="20%">
</p>
## 数据集
### 授权协议
本数据集采用CC-BY-NC-SA-4.0开源协议。
LVBench仅可用于学术研究,严禁任何形式的商业使用。我们不拥有任何原始视频文件的版权。
若MotionBench中存在任何侵权内容,请联系shiyu.huang@aminer.cn或直接提交Issue,我们将立即移除相关内容。
### 下载方式
请先安装video2dataset工具:
shell
pip install video2dataset
pip uninstall transformer-engine
随后从[Huggingface平台](https://huggingface.co/datasets/THUDM/MotionBench)下载`video_info.meta.jsonl`文件,并将其放置于`data`目录下。
`video_info.meta.jsonl`文件中的每一行对应一个视频样本。部分数据集包含标准答案(DEV集),部分则不包含(TEST集)。您可使用DEV集优化您的模型,并将结果文件上传至我们的[评测排行榜](https://huggingface.co/spaces/THUDM/MotionBench),以查看您的模型性能表现。
#### 标注数据集
我们的部分数据集源自自主标注的详细视频字幕。我们额外发布了一个包含5000个视频的数据集,其中带有手动标注的细粒度运动描述,该标注流程与基准数据集的标注过程同步进行并经过双重校验。每个视频均包含动态信息描述,标注密度可达每秒12.63个单词,可为研究人员提供用于进一步开发与训练的资源,以提升视频模型的运动级理解能力。
#### 自主采集数据集
我们提供了所有自主采集数据的下载链接。
#### 公开数据集
对于公开可用的数据,我们不提供原始视频文件。您可从其原始仓库下载:
1. MedVid: https://github.com/deepaknlp/MedVidQACL
2. SportsSloMo: https://cvlab.cse.msu.edu/project-svw.html
3. HA-ViD: https://iai-hrc.github.io/ha-vid
下载上述数据集后,可通过`data/mapping.json`映射文件,将下载得到的文件名映射为本基准中的对应文件名。
随后,使用文件名中以`_`分隔的最后两个整数作为剪辑时间戳,对视频进行切片。
例如,视频文件`S10A13I22S1.mp4`根据映射文件可对应到`ef476626-3499-40c2-bbd6-5004223d1ada`。若要获取`video_info.meta.jsonl`中的最终测试用例`ef476626-3499-40c2-bbd6-5004223d1ada_58_59`,您需要将视频从第58秒剪辑至第59秒,即可得到用于基准评测的最终视频样本。
## 安装MotionBench
shell
pip install -e .
## 获取评测结果并提交至排行榜
(注:若您希望快速体验评测流程,可使用`scripts/construct_random_answers.py`生成随机答案文件。)
shell
cd scripts
python test_acc.py
执行完成后,您将在`scripts`目录下得到评测结果文件`random_answers.json`。您可将该结果提交至[评测排行榜](https://huggingface.co/spaces/THUDM/MotionBench)。
## 📈 评测结果
- **模型性能对比:**
<p align="center">
<img src="./docs/tab3.png" width="96%" height="50%">
</p>
- **基准数据集对比:**
<p align="center">
<img src="./docs/image3.png" width="96%" height="50%">
</p>
- **答案分布情况:**
<p align="center">
<img src="./docs/image5.png" width="96%" height="50%">
</p>
## 引用声明
若您的研究工作受益于本项目,请考虑引用我们的论文:
bibtex
@misc{hong2024motionbench,
title={MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models},
author={Wenyi Hong and Yean Cheng and Zhuoyi Yang and Weihan Wang and Lefan Wang and Xiaotao Gu and Shiyu Huang and Yuxiao Dong and Jie Tang},
year={2024},
eprint={2501.02955},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
maas
创建时间:
2025-07-30



