FudanCVL/MeViSv2
收藏Hugging Face2025-11-27 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/FudanCVL/MeViSv2
下载链接
链接失效反馈官方服务:
资源简介:
# MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
**[🏠[Project page]](https://henghuiding.github.io/MeViS/)**  **[📄[arXiv]](https://arxiv.org/abs/2308.08544)**   **[💾[Evaluation Server v1 (legacy)]](https://www.codabench.org/competitions/11420/)**  **[🔥[Evaluation Server v2]](https://www.codabench.org/competitions/11420/)**
This repository contains code for **ICCV2023** and **TPAMI 2025** paper:
> [MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation](https://ieeexplore.ieee.org/abstract/document/11130435)
> Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
> TPAMI 2025
> [MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions](https://arxiv.org/abs/2308.08544)
> Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
> ICCV 2023
<table border=1 frame=void>
<tr>
<td><img src="images/bird.gif" width="245"></td>
<td><img src="images/Cat.gif" width="245"></td>
<td><img src="images/coin.gif" width="245"></td>
</tr>
</table>
### Abstract
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes.

<p style="text-align:justify; text-justify:inter-ideograph;width:100%">Figure 1. Examples from <b>M</b>otion <b>e</b>xpressions <b>Vi</b>deo <b>S</b>egmentation (<b>MeViS</b>) showing the dataset’s nature and complexity. The selected target objects are masked in <font color="#FF6403">orange ▇</font>. The expressions in MeViS primarily focus on motion attributes, making it impossible to identify the target object from a single frame. For example, the first example has three parrots with similar appearances, and the target object is identified as “<i>The bird flying away</i>”. This object can only be recognized by capturing its motion throughout the video. The updated MeViS 2024 further provides motion-reasoning and no-target expressions, adds audio expressions alongside text, and provides mask and bounding box trajectory annotations.</p>
<table border="0.6">
<div align="center">
<caption><b>TABLE 1. Scale comparison between MeViS and existing language-guided video segmentation datasets.
</div>
<tbody>
<tr>
<th align="right" bgcolor="BBBBBB">Dataset</th>
<th align="center" bgcolor="BBBBBB">Pub.&Year</th>
<th align="center" bgcolor="BBBBBB">Videos</th>
<th align="center" bgcolor="BBBBBB">Object</th>
<th align="center" bgcolor="BBBBBB">Expression</th>
<th align="center" bgcolor="BBBBBB">Mask</th>
<th align="center" bgcolor="BBBBBB">Obj/Video</th>
<th align="center" bgcolor="BBBBBB">Obj/Expn</th>
<th align="center" bgcolor="BBBBBB">Target</th>
<th align="center" bgcolor="BBBBBB">Multi-target</th>
<th align="center" bgcolor="BBBBBB">No-target</th>
<th align="center" bgcolor="BBBBBB">Audio</th>
</tr>
<tr>
<td align="right"><a href="https://kgavrilyuk.github.io/publication/actor_action/" target="_blank">A2D Sentence</a></td>
<td align="center">CVPR 2018</td>
<td align="center">3,782</td>
<td align="center">4,825</td>
<td align="center">6,656</td>
<td align="center">58k</td>
<td align="center">1.28</td>
<td align="center">1</td>
<td align="center">Actor</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="right" bgcolor="ECECEC"><a href="https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/video-segmentation/video-object-segmentation-with-language-referring-expressions" target="_blank">DAVIS17-RVOS</a></td>
<td align="center" bgcolor="ECECEC">ACCV 2018</td>
<td align="center" bgcolor="ECECEC">90</td>
<td align="center" bgcolor="ECECEC">205</td>
<td align="center" bgcolor="ECECEC">205</td>
<td align="center" bgcolor="ECECEC">13.5k</td>
<td align="center" bgcolor="ECECEC">2.27</td>
<td align="center" bgcolor="ECECEC">1</td>
<td align="center" bgcolor="ECECEC">Object</td>
<td align="center" bgcolor="ECECEC">-</td>
<td align="center" bgcolor="ECECEC">-</td>
<td align="center" bgcolor="ECECEC">-</td>
</tr>
<tr>
<td align="right"><a href="https://youtube-vos.org/dataset/rvos/" target="_blank">ReferYoutubeVOS</a></td>
<td align="center">ECCV 2020</td>
<td align="center">3,978</td>
<td align="center">7,451</td>
<td align="center">15,009</td>
<td align="center">131k</td>
<td align="center">1.86</td>
<td align="center">1</td>
<td align="center">Object</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="right" bgcolor="E5E5E5"><b>MeViS 2023</b></td>
<td align="center" bgcolor="E5E5E5"><b>ICCV 2023</b></td>
<td align="center" bgcolor="E5E5E5"><b>2,006</b></td>
<td align="center" bgcolor="E5E5E5"><b>8,171</b></td>
<td align="center" bgcolor="E5E5E5"><b>28,570</b></td>
<td align="center" bgcolor="E5E5E5"><b>443k</b></td>
<td align="center" bgcolor="E5E5E5"><b>4.28</b></td>
<td align="center" bgcolor="E5E5E5"><b>1.59</b></td>
<td align="center" bgcolor="E5E5E5"><b>Object(s)</b></td>
<td align="center" bgcolor="E5E5E5">7,539</td>
<td align="center" bgcolor="E5E5E5">-</td>
<td align="center" bgcolor="E5E5E5">-</td>
</tr>
<tr>
<td align="right"><b>MeViS 2024</b></td>
<td align="center"><b>TPAMI</b></td>
<td align="center"><b>2,006</b></td>
<td align="center"><b>8,171</b></td>
<td align="center"><b>33,072</b></td>
<td align="center"><b>443k</b></td>
<td align="center"><b>4.28</b></td>
<td align="center"><b>1.58</b></td>
<td align="center"><b>Object(s)</b></td>
<td align="center">8,028</td>
<td align="center">3,503</td>
<td align="center">33,072</td>
</tr>
</tbody>
<colgroup>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
</colgroup>
</table>
## MeViS v2 Dataset
**Dataset Split**
- 2,006 videos & 33,458 sentences in total;
- **Train set:** 1662 videos & 27,502 sentences, used for training;
- **Val<sup>u</sup> set:** 50 videos & 907 sentences, ground-truth provided, used for offline self-evaluation (e.g., ablation study) during training;
- **Val set:** 140 videos & 2,523 sentences, ground-truth **not** provided, used for [**CodaLab online evaluation**](https://www.codabench.org/competitions/11420/);
- **Test set:** Will be progressively and selectively released and used for evaluation during the competition periods ([PVUW](https://pvuw.github.io/), [LSVOS](https://lsvos.github.io/));
It is suggested to report the results on **Val<sup>u</sup> set** and **Val set**.
## Online Evaluation
Please submit your results of **Val set** on
- 💯 v1 server (Closing Soon): [**CodaLab**](https://codalab.lisn.upsaclay.fr/competitions/15094)
- 💯 v2 server: [**CodaBench**](https://www.codabench.org/competitions/11420/).
It is strongly suggested to first evaluate your model locally using the **Val<sup>u</sup>** set before submitting your results of the **Val** to the online evaluation system.
## File Structure
The dataset follows a similar structure as [Refer-YouTube-VOS](https://youtube-vos.org/dataset/rvos/). Each split of the dataset consists of three parts: `JPEGImages`, which holds the frame images, `meta_expressions.json`, which provides referring expressions and metadata of videos, and `mask_dict.json`, which contains the ground-truth masks of objects. Ground-truth segmentation masks are saved in the format of COCO RLE, and expressions are organized similarly like Refer-Youtube-VOS.
Please note that while annotations for all frames in the **Train** set and the **Val<sup>u</sup>** set are provided, the **Val** set only provide frame images and referring expressions for inference.
```
mevis
├── train // Split Train
│ ├── JPEGImages
│ │ ├── <video #1 >
│ │ ├── <video #2 >
│ │ └── <video #...>
│ │
│ ├── mask_dict.json
│ └── meta_expressions.json
│
├── valid_u // Split Val^u
│ ├── JPEGImages
│ │ └── <video ...>
│ │
│ ├── mask_dict.json
│ └── meta_expressions.json
│
└── valid // Split Val
├── JPEGImages
│ └── <video ...>
│
└── meta_expressions.json
```
## BibTeX
Please consider to cite MeViS if it helps your research.
```latex
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}
```
```latex
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
```
```latex
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
```
A majority of videos in MeViS are from [MOSE: Complex Video Object Segmentation Dataset](https://henghuiding.github.io/MOSE/).
```latex
@inproceedings{MOSE,
title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
booktitle={ICCV},
year={2023}
}
```
MeViS is licensed under a CC BY-NC-SA 4.0 License. The data of MeViS is released for non-commercial research purpose only.
# MeViS:面向运动表达式指代视频分割的多模态数据集
**[🏠[项目主页]](https://henghuiding.github.io/MeViS/)**  **[📄[arXiv论文]](https://arxiv.org/abs/2308.08544)**   **[💾[评估服务器v1(旧版)]](https://www.codabench.org/competitions/11420/)**  **[🔥[评估服务器v2]](https://www.codabench.org/competitions/11420/)
本仓库包含对应ICCV 2023与TPAMI 2025两篇论文的代码:
> [MeViS:基于运动表达式的视频分割大规模基准数据集](https://ieeexplore.ieee.org/abstract/document/11130435)
> 作者:Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
> 发表于TPAMI 2025
> [MeViS:基于运动表达式的视频分割大规模基准数据集](https://arxiv.org/abs/2308.08544)
> 作者:Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
> 发表于ICCV 2023
<table border=1 frame=void>
<tr>
<td><img src="images/bird.gif" width="245"></td>
<td><img src="images/Cat.gif" width="245"></td>
<td><img src="images/coin.gif" width="245"></td>
</tr>
</table>
### 摘要
本文提出了一款大规模多模态数据集,用于基于运动表达式的指代视频分割任务,旨在根据物体运动的语言描述,对视频中的目标对象进行分割与跟踪。现有指代视频分割数据集通常聚焦于显著物体,且采用富含静态属性的语言表达式,这使得目标对象可仅通过单帧图像即可识别,从而忽视了运动在视频与语言中的核心作用。为探索利用运动表达式与运动推理线索实现像素级视频理解的可行性,我们构建了MeViS数据集,该数据集包含2006段复杂场景视频中的8171个对象,以及由人工标注的33072条文本与音频形式的运动表达式。我们基于MeViS支持的4项任务对15种现有方法进行了基准测试,其中包括6种指代视频对象分割(referring video object segmentation, RVOS)方法、3种音频引导视频对象分割(audio-guided video object segmentation, AVOS)方法、2种指代多目标跟踪(referring multi-object tracking, RMOT)方法,以及针对新增的指代运动表达式生成(referring motion expression generation, RMEG)任务的4种视频字幕生成方法。实验结果揭示了现有方法在处理运动表达式引导的视频理解任务时存在的不足与局限性。我们进一步分析了该任务的挑战,并针对RVOS/AVOS/RMOT任务提出了LMPM++方法,实现了当前最优性能。本数据集为复杂视频场景下的运动表达式引导视频理解算法研发提供了支撑平台。

<p style="text-align:justify; text-justify:inter-ideograph;width:100%">图1 来自<b>M</b>otion <b>e</b>xpressions <b>Vi</b>deo <b>S</b>egmentation(<b>MeViS</b>,运动表达式视频分割)的示例,展示了该数据集的特性与复杂性。选中的目标对象以<font color="#FF6403">橙色▇</font>掩码标注。MeViS中的表达式主要聚焦于运动属性,因此无法仅通过单帧图像识别目标对象。例如,第一个示例中有三只外观相似的鹦鹉,目标对象被描述为"<i>飞走的那只鸟</i>",仅通过捕捉该对象在整个视频中的运动轨迹才能将其区分。更新后的MeViS 2024新增了运动推理与无目标表达式,补充了音频形式的表达式,并提供了掩码与边界框轨迹标注。</p>
<table border="0.6">
<div align="center">
<caption><b>表1 MeViS与现有语言引导视频分割数据集的规模对比</b>
</div>
<tbody>
<tr>
<th align="right" bgcolor="BBBBBB">数据集</th>
<th align="center" bgcolor="BBBBBB">发表年份</th>
<th align="center" bgcolor="BBBBBB">视频数</th>
<th align="center" bgcolor="BBBBBB">对象数</th>
<th align="center" bgcolor="BBBBBB">表达式数</th>
<th align="center" bgcolor="BBBBBB">掩码数</th>
<th align="center" bgcolor="BBBBBB">每视频平均对象数</th>
<th align="center" bgcolor="BBBBBB">每条表达式平均对象数</th>
<th align="center" bgcolor="BBBBBB">目标类型</th>
<th align="center" bgcolor="BBBBBB">多目标样本数</th>
<th align="center" bgcolor="BBBBBB">无目标样本数</th>
<th align="center" bgcolor="BBBBBB">音频表达式支持</th>
</tr>
<tr>
<td align="right"><a href="https://kgavrilyuk.github.io/publication/actor_action/" target="_blank">A2D Sentence</a></td>
<td align="center">CVPR 2018</td>
<td align="center">3,782</td>
<td align="center">4,825</td>
<td align="center">6,656</td>
<td align="center">58k</td>
<td align="center">1.28</td>
<td align="center">1</td>
<td align="center">Actor</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="right" bgcolor="ECECEC"><a href="https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/video-segmentation/video-object-segmentation-with-language-referring-expressions" target="_blank">DAVIS17-RVOS</a></td>
<td align="center" bgcolor="ECECEC">ACCV 2018</td>
<td align="center" bgcolor="ECECEC">90</td>
<td align="center" bgcolor="ECECEC">205</td>
<td align="center" bgcolor="ECECEC">205</td>
<td align="center" bgcolor="ECECEC">13.5k</td>
<td align="center" bgcolor="ECECEC">2.27</td>
<td align="center" bgcolor="ECECEC">1</td>
<td align="center" bgcolor="ECECEC">Object</td>
<td align="center" bgcolor="ECECEC">-</td>
<td align="center" bgcolor="ECECEC">-</td>
<td align="center" bgcolor="ECECEC">-</td>
</tr>
<tr>
<td align="right"><a href="https://youtube-vos.org/dataset/rvos/" target="_blank">ReferYoutubeVOS</a></td>
<td align="center">ECCV 2020</td>
<td align="center">3,978</td>
<td align="center">7,451</td>
<td align="center">15,009</td>
<td align="center">131k</td>
<td align="center">1.86</td>
<td align="center">1</td>
<td align="center">Object</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="right" bgcolor="E5E5E5"><b>MeViS 2023</b></td>
<td align="center" bgcolor="E5E5E5"><b>ICCV 2023</b></td>
<td align="center" bgcolor="E5E5E5"><b>2,006</b></td>
<td align="center" bgcolor="E5E5E5"><b>8,171</b></td>
<td align="center" bgcolor="E5E5E5"><b>28,570</b></td>
<td align="center" bgcolor="E5E5E5"><b>443k</b></td>
<td align="center" bgcolor="E5E5E5"><b>4.28</b></td>
<td align="center" bgcolor="E5E5E5"><b>1.59</b></td>
<td align="center" bgcolor="E5E5E5"><b>Object(s)</b></td>
<td align="center" bgcolor="E5E5E5">7,539</td>
<td align="center" bgcolor="E5E5E5">-</td>
<td align="center" bgcolor="E5E5E5">-</td>
</tr>
<tr>
<td align="right"><b>MeViS 2024</b></td>
<td align="center"><b>TPAMI</b></td>
<td align="center"><b>2,006</b></td>
<td align="center"><b>8,171</b></td>
<td align="center"><b>33,072</b></td>
<td align="center"><b>443k</b></td>
<td align="center"><b>4.28</b></td>
<td align="center"><b>1.58</b></td>
<td align="center"><b>Object(s)</b></td>
<td align="center">8,028</td>
<td align="center">3,503</td>
<td align="center">33,072</td>
</tr>
</tbody>
<colgroup>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
<col>
</colgroup>
</table>
### MeViS v2 数据集
**数据集划分**
- 总计包含2006段视频与33,458条语句;
- **训练集(Train set)**:1662段视频与27,502条语句,用于模型训练;
- **验证集u(Val<sup>u</sup> set)**:50段视频与907条语句,提供真值标注,用于训练期间的离线自主评估(如消融实验);
- **验证集(Val set)**:140段视频与2,523条语句,不提供真值标注,用于[**CodaLab在线评估**](https://www.codabench.org/competitions/11420/);
- **测试集(Test set)**:将分阶段选择性发布,用于竞赛期间的评估([PVUW](https://pvuw.github.io/), [LSVOS](https://lsvos.github.io/));
建议在**验证集u**与**验证集**上报告实验结果。
### 在线评估
请将**验证集**的预测结果提交至:
- 💯 v1服务器(即将关闭):[**CodaLab**](https://codalab.lisn.upsaclay.fr/competitions/15094)
- 💯 v2服务器:[**CodaBench**](https://www.codabench.org/competitions/11420/)。
强烈建议在将**验证集**结果提交至在线评估系统前,先使用**验证集u**在本地完成模型评估。
### 文件结构
本数据集的结构与[Refer-YouTube-VOS](https://youtube-vos.org/dataset/rvos/)类似。数据集的每个划分均包含三部分:`JPEGImages`,存储帧图像;`meta_expressions.json`,提供指代表达式与视频元数据;`mask_dict.json`,包含目标对象的真值掩码。真值分割掩码以COCO RLE格式存储,表达式的组织方式与Refer-Youtube-VOS一致。
请注意,**训练集**与**验证集u**的所有帧均提供标注,而**验证集**仅提供帧图像与指代表达式用于推理。
mevis
├── train // 训练集划分
│ ├── JPEGImages
│ │ ├── <video #1 >
│ │ ├── <video #2 >
│ │ └── <video #...>
│ │
│ ├── mask_dict.json
│ └── meta_expressions.json
│
├── valid_u // 验证集u划分
│ ├── JPEGImages
│ │ └── <video ...>
│ │
│ ├── mask_dict.json
│ └── meta_expressions.json
│
└── valid // 验证集划分
├── JPEGImages
│ └── <video ...>
│
└── meta_expressions.json
### BibTeX引用
若本数据集对您的研究有所帮助,请引用MeViS相关论文:
latex
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}
latex
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
latex
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
MeViS中的大部分视频来自[MOSE:复杂场景视频对象分割数据集](https://henghuiding.github.io/MOSE/)。
latex
@inproceedings{MOSE,
title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
booktitle={ICCV},
year={2023}
}
MeViS采用CC BY-NC-SA 4.0许可协议发布,仅可用于非商业研究用途。
提供机构:
FudanCVL



