five

WM-ABench

收藏
魔搭社区2025-11-27 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/maitrix-org/WM-ABench
下载链接
链接失效反馈
官方服务:
资源简介:
# WM-ABench: An Atomic Evaluation Benchmark of World Modeling abilities of Vision-Language Models **Paper:** *[Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation](https://arxiv.org/abs/2506.21876)* WM-ABench is a comprehensive benchmark that evaluates whether Vision-Language Models (VLMs) can truly understand and simulate physical world dynamics, or if they rely on shortcuts and pattern-matching. The benchmark covers 23 dimensions of world modeling across 6 physics simulators with over 100,000 evaluation instances. **🔗 [Visit the Official Website](https://wm-abench.maitrix.org/)** for leaderboard, paper, and walkthrough video. ![WM-ABench Overview](overview.png) ## What Does WM-ABench Test? The benchmark evaluates VLMs on two main capabilities: **🔍 Perception Tasks** - Can the model understand what's happening? - Visual attributes (color, shape, material) - Spatial relationships and 3D positioning - Object counting and quantities - Temporal relationships - Motion detection and trajectories **🔮 Prediction Tasks** - Can the model predict what happens next? - Physics simulation (dropping, sliding, collisions) - Mechanistic agent actions (lifting, pushing objects) - Transitive reasoning (sequential actions) - Compositional actions (multiple things happening at once) ## Quick Start ```python #!pip install --upgrade --force-reinstall datasets fsspec huggingface_hub from datasets import load_dataset # Load a 100-instance subset for quick evaluation ds = load_dataset("maitrix-org/WM-ABench", "Compositionality_maniskill_lift_subset") # Or load the full dataset # ds = load_dataset("maitrix-org/WM-ABench", "Compositionality_maniskill_lift") print(ds['test'][0]['prompt']) ``` ## Dataset Structure Each task provides: - **Video frames** showing the initial scene/entire sequence of motion - **Multiple choice questions** with 4 possible outcomes - **Hard negative options** that look similar but are physically incorrect - **Rich metadata** about object properties and scene setup The benchmark uses data from 6 simulation frameworks: ThreeDWorld, ManiSkill 2 & 3, Habitat Lab 2.0, Physion, and Carla. ## Key Features **Avoiding Shortcuts**: Unlike other benchmarks, WM-ABench includes carefully designed "hard negative" answer choices that prevent models from succeeding through visual similarity alone. Models must truly understand physics to get the right answer. **Atomic Evaluation**: WM-ABench provides fine-grained analysis of specific world modeling capabilities. **Scale**: Over 100,000 test instances across 23 different dimensions of world modeling. ## Data Fields Fields vary by task, but typically include: - `source`: Video frames of the initial state - `prompt`: The question asked to the model - `image_choices` or `choices`: Possible answer options - `answer`: Index or value of the correct answer - `params`: Scene setup details ## Usage Notes - All configurations have only a `test` split - Each full configuration has a `_subset` version with 100 instances for faster evaluation - Ground truth labels are programmatically generated from simulator states - Human evaluation baselines show moderate to high agreement (Fleiss' κ > 0.4) ## Citation ```bibtex @misc{gao2025visionlanguagemodelsinternalworld, title={Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation}, author={Qiyue Gao and Xinyu Pi and Kevin Liu and Junrong Chen and Ruolan Yang and Xinqi Huang and Xinyu Fang and Lu Sun and Gautham Kishore and Bo Ai and Stone Tao and Mengyang Liu and Jiaxi Yang and Chao-Jung Lai and Chuanyang Jin and Jiannan Xiang and Benhao Huang and Zeming Chen and David Danks and Hao Su and Tianmin Shu and Ziqiao Ma and Lianhui Qin and Zhiting Hu}, year={2025}, eprint={2506.21876}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` **License:** Apache 2.0

# WM-ABench: 视觉语言模型(Vision-Language Models, VLMs)世界建模能力的原子化评估基准 **论文:** *[Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation](https://arxiv.org/abs/2506.21876)* WM-ABench是一款综合性评估基准,用于检验视觉语言模型能否真正理解并模拟物理世界的动态变化,而非依赖捷径式解题与模式匹配。该基准覆盖6个物理模拟器中的23项世界建模维度,包含超10万个评估实例。 **🔗 [访问官方网站](https://wm-abench.maitrix.org/)** 以获取排行榜、论文及演示视频。 ![WM-ABench 概览](overview.png) ## WM-ABench 评估哪些能力? 该基准从两大核心能力维度对视觉语言模型进行评估: **🔍 感知任务** - 模型能否理解当前场景? - 视觉属性(颜色、形状、材质) - 空间关系与三维定位 - 物体计数与数量 - 时序关系 - 运动检测与轨迹 **🔮 预测任务** - 模型能否预测后续事件? - 物理模拟(下落、滑动、碰撞) - AI智能体的机械动作(抬举、推搡物体) - 传递推理(序列动作) - 组合式动作(多事件同时发生) ## 快速上手 python #!pip install --upgrade --force-reinstall datasets fsspec huggingface_hub from datasets import load_dataset # 加载包含100个实例的子集以快速评估 ds = load_dataset("maitrix-org/WM-ABench", "Compositionality_maniskill_lift_subset") # 或加载完整数据集 # ds = load_dataset("maitrix-org/WM-ABench", "Compositionality_maniskill_lift") print(ds['test'][0]['prompt']) ## 数据集结构 每个任务均提供以下内容: - **视频帧**:展示初始场景或完整运动序列 - **多项选择题**:包含4种可能结果 - **难负样本选项**:外观相似但物理逻辑不正确的干扰项 - **丰富元数据**:涵盖物体属性与场景设置信息 该基准使用来自6个仿真框架的数据:ThreeDWorld、ManiSkill 2 & 3、Habitat Lab 2.0、Physion与Carla。 ## 核心特性 **规避捷径式解题**:与其他评估基准不同,WM-ABench精心设计了「难负样本」答案选项,可防止模型仅通过视觉相似性获得正确结果,迫使模型真正理解物理规律才能答对题目。 **原子化评估**:WM-ABench可对特定世界建模能力进行细粒度分析。 **大规模覆盖**:涵盖23项世界建模维度、超10万个测试实例。 ## 数据字段 不同任务的字段存在差异,但通常包含以下内容: - `source`:初始状态的视频帧 - `prompt`:向模型提出的问题 - `image_choices` 或 `choices`:可能的答案选项 - `answer`:正确答案的索引或取值 - `params`:场景设置细节 ## 使用说明 - 所有配置仅包含`test`数据集划分 - 每个完整配置均附带包含100个实例的`_subset`版本,用于快速验证评估流程 - 真实标签由模拟器状态通过程序自动生成 - 人工评估基线显示出中等至高度的一致性(Fleiss' κ > 0.4) ## 引用 bibtex @misc{gao2025visionlanguagemodelsinternalworld, title={Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation}, author={Qiyue Gao and Xinyu Pi and Kevin Liu and Junrong Chen and Ruolan Yang and Xinqi Huang and Xinyu Fang and Lu Sun and Gautham Kishore and Bo Ai and Stone Tao and Mengyang Liu and Jiaxi Yang and Chao-Jung Lai and Chuanyang Jin and Jiannan Xiang and Benhao Huang and Zeming Chen and David Danks and Hao Su and Tianmin Shu and Ziqiao Ma and Lianhui Qin and Zhiting Hu}, year={2025}, eprint={2506.21876}, archivePrefix={arXiv}, primaryClass={cs.CL} } **授权协议:** Apache 2.0
提供机构:
maas
创建时间:
2025-07-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作