WorldSense
收藏魔搭社区2025-12-19 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/honglyhly/WorldSense
下载链接
链接失效反馈官方服务:
资源简介:
## 🔥 News
* **`2025.02.07`** 🌟 We release WorldSense, the first benchmark for real-world omnimodal understanding of MLLMs.
## 👀 WorldSense Overview
we introduce **WorldSense**, the **first** benchmark to assess the multi-modal video understanding, that simultaneously encompasses _visual, audio, and text_ inputs. In contrast to existing benchmarks, our **WorldSense** has several features:
* **Collaboration of omni-modality**. We design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the **synergistic perception of omni-modality**;
* **Diversity of videos and tasks**. WorldSense encompasses a diverse collection of **1,662** audio-visual synchronised videos, systematically categorized into **8** primary domains and **67** fine-grained subcategories to cover the broad scenarios, and **3,172** multi-choice QA pairs across **26** distinct tasks to enable the comprehensive evaluation;
* **High-quality annotations**. All the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality.
Based on our **WorldSense**, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48% best accuracy). We hope our **WorldSense** can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.
<p align="center">
<img src="./asset/distribution.png" width="100%" height="100%">
</p>
## 📐 Dataset Examples
<p align="center">
<img src="./asset/sample.png" width="100%" height="100%">
</p>
## 🔍 Dataset
Please download our WorldSense from [here](https://huggingface.co/datasets/honglyhly/WorldSense).
## 🔮 Evaluation Pipeline
📍 **Evaluation**:
Thanks for the reproduction of our evaluation through [VLMEvalkit](https://github.com/open-compass/VLMEvalKit). Please refer to [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) for details.
📍 **Leaderboard**:
If you want to add your model to our [leaderboard](https://jaaackhongggg.github.io/WorldSense/#leaderboard), please contact **jaaackhong@gmail.com**.
## 📈 Experimental Results
- **Evaluation results of sota MLLMs.**
<p align="center">
<img src="./asset/overall_performance.png" width="96%" height="50%">
</p>
- **Fine-grained results on task category.**
<p align="center">
<img src="./asset/fine_task.png" width="96%" height="50%">
</p>
- **Fine-grained results on audio type.**
<p align="center">
<img src="./asset/fine_audio.png" width="96%" height="50%">
</p>
- **In-depth analysis for real-world omnimodal understanding.**
<center>Impact of vision information.</center>
<p align="center">
<img src="./asset/ablation_vision.png" width="96%" height="96%">
</p>
<center>Impact of audio information.</center>
<p align="center">
<img src="./asset/ablation_audio.png" width="96%" height="96%">
</p>
<center>Impact of audio information for Video MLLMs.</center>
<p align="center">
<img src="./asset/ablation_audio_v.png" width="96%" height="96%">
</p>
<center>Impact of video frames.</center>
<p align="center">
<img src="./asset/video_frame_curve.png" width="96%" height="96%">
</p>
## 📖 Citation
If you find WorldSense helpful for your research, please consider citing our work. Thanks!
```bibtex
@article{hong2025worldsenseevaluatingrealworldomnimodal,
title={WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs},
author={Jack Hong and Shilin Yan and Jiayin Cai and Xiaolong Jiang and Yao Hu and Weidi Xie},
year={2025},
eprint={2502.04326},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.04326},
}
```
## 🔥 最新资讯
* **`2025.02.07`** 🌟 我们发布了WorldSense,首个面向多模态大语言模型(Multimodal LLM)真实场景全模态理解的基准测试集。
## 👀 WorldSense 概览
我们提出了**WorldSense**,这是首个用于评估多模态视频理解能力的基准测试集,同时涵盖视觉、音频与文本三类输入。与现有基准测试集相比,本基准具有三大核心特性:
* **全模态协同感知**:我们设计的评估任务实现了音视频的强耦合,要求模型有效利用全模态协同感知能力;
* **视频与任务多样性**:WorldSense包含1662条音视频同步视频,系统划分为8个大类与67个细粒度子类别以覆盖广泛应用场景,同时配套3172道多项选择问答对,涵盖26类不同任务,支持全面的性能评估;
* **高质量标注**:所有问答对均由80名专业标注人员经过多轮校验后手动标注,以确保标注质量。
基于WorldSense基准,我们对多款前沿多模态大语言模型进行了广泛评估。实验结果显示,现有模型在理解真实世界场景时面临显著挑战(最优准确率仅为48%)。我们期望WorldSense能够为评估模型从全模态数据中构建与理解连贯上下文的能力提供标准化平台。
<p align="center">
<img src="./asset/distribution.png" width="100%" height="100%">
</p>
## 📐 数据集示例
<p align="center">
<img src="./asset/sample.png" width="100%" height="100%">
</p>
## 🔍 数据集下载
请从[此处](https://huggingface.co/datasets/honglyhly/WorldSense)下载WorldSense数据集。
## 🔮 评估流程
📍 **评估方式**:
感谢大家通过[VLMEvalkit](https://github.com/open-compass/VLMEvalKit)复现我们的评估流程,详细信息请参考VLMEvalKit官方仓库。
📍 **排行榜**:
若您希望将自己的模型加入我们的[排行榜](https://jaaackhongggg.github.io/WorldSense/#leaderboard),请联系邮箱**jaaackhong@gmail.com**。
## 📈 实验结果
- **前沿多模态大语言模型评估结果**
<p align="center">
<img src="./asset/overall_performance.png" width="96%" height="50%">
</p>
- **任务类别细粒度结果**
<p align="center">
<img src="./asset/fine_task.png" width="96%" height="50%">
</p>
- **音频类型细粒度结果**
<p align="center">
<img src="./asset/fine_audio.png" width="96%" height="50%">
</p>
- **真实场景全模态理解深度分析**
<center>视觉信息的影响。</center>
<p align="center">
<img src="./asset/ablation_vision.png" width="96%" height="96%">
</p>
<center>音频信息的影响。</center>
<p align="center">
<img src="./asset/ablation_audio.png" width="96%" height="96%">
</p>
<center>视频多模态大语言模型的音频信息影响。</center>
<p align="center">
<img src="./asset/ablation_audio_v.png" width="96%" height="96%">
</p>
<center>视频帧数的影响。</center>
<p align="center">
<img src="./asset/video_frame_curve.png" width="96%" height="96%">
</p>
## 📖 论文引用
若您的研究工作中用到了WorldSense,请考虑引用我们的论文。感谢您的支持!
bibtex
@article{hong2025worldsenseevaluatingrealworldomnimodal,
title={WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs},
author={Jack Hong and Shilin Yan and Jiayin Cai and Xiaolong Jiang and Yao Hu and Weidi Xie},
year={2025},
eprint={2502.04326},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.04326},
}
提供机构:
maas
创建时间:
2025-02-21



