HiggsBoson/vizdoom-llm-inverse-dynamics-benchmark
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HiggsBoson/vizdoom-llm-inverse-dynamics-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: VizDoom LLM Inverse Dynamics Benchmark
tags:
- vision
- multimodal
- reinforcement-learning
- inverse-dynamics
- vizdoom
- benchmark
size_categories:
- n<1K
---
# VizDoom LLM Inverse Dynamics Benchmark
## Dataset Summary
This dataset is a small, manually inspectable benchmark for evaluating how well
large language models and vision-language models can act as **inverse dynamics
models** in a first-person game environment.
Each example contains a **9-frame temporal window** centered on a labeled
timestep `t`:
`[x_(t-4), x_(t-3), x_(t-2), x_(t-1), x_t, x_(t+1), x_(t+2), x_(t+3), x_(t+4)] -> a_t`
where `a_t` is the action taken at time `t`.
The benchmark is intended for **evaluation**, not large-scale training. It was
designed to support questions like:
- Can an LLM or VLM infer the player action from short-term visual dynamics?
- Which actions are easiest or hardest to distinguish?
- How does model performance vary with human-annotated difficulty?
The current public release contains:
- 1 split: `benchmark`
- 100 examples
- 9 image columns per example for the temporal window
- action labels from a small discrete navigation action space
- optional manual annotation fields for difficulty and difficulty rationale
## Task Framing
This benchmark evaluates **inverse dynamics prediction** from egocentric visual
observations in VizDoom. The goal is to predict the discrete action taken at the
center frame.
Supported action labels:
- `forward`
- `turn_left`
- `turn_right`
- `strafe_left`
- `strafe_right`
- `idle`
## Data Source and Construction
The benchmark was constructed from trajectories collected in VizDoom using a
navigation-oriented action space. Each benchmark row is derived from a labeled
decision timestep and exports:
- the center-frame action label
- 4 frames before the decision
- the frame at the decision timestep
- 4 frames after the decision
- a contact-sheet image for quick human inspection
The benchmark was intentionally stored in a format that is easy to:
- inspect manually
- annotate in a spreadsheet
- export into Hugging Face `Image` features
## Data Fields
Each row includes the following fields:
- `sample_id`: unique sample identifier
- `sequence_image`: a contact sheet showing all 9 frames side by side
- `image_before`: alias for `frame_t`
- `image_after`: alias for `frame_t_plus_1`
- `frame_t_minus_4`
- `frame_t_minus_3`
- `frame_t_minus_2`
- `frame_t_minus_1`
- `frame_t`
- `frame_t_plus_1`
- `frame_t_plus_2`
- `frame_t_plus_3`
- `frame_t_plus_4`
- `action`: ground-truth action name
- `ground_truth_action`: duplicate of `action` for compatibility with older evaluation flows
- `action_id`: integer action id
- `difficulty`: optional human annotation of perceived difficulty
- `difficulty_reason`: optional short human explanation for the difficulty label
- `episode_id`: source episode identifier
- `split`: split name
- `label_frame_index`: timestep index of the labeled action within the episode
## Intended Uses
This dataset is intended for:
- benchmarking LLMs and VLMs on inverse dynamics prediction
- prompting studies over short visual sequences
- error analysis across action types
- analysis conditioned on human difficulty labels
Reasonable evaluation setups include:
- prompting with the full 9-frame context
- using only `image_before` and `image_after` as a two-frame baseline
- comparing temporal prompting versus single-frame prompting
## Out-of-Scope Uses
This dataset is **not** intended for:
- training a robust general-purpose policy
- estimating real-world human behavior
- evaluating open-world navigation competence beyond this narrow benchmark setup
- drawing broad conclusions about embodied reasoning from only 100 examples
## Annotation Notes
The `difficulty` and `difficulty_reason` fields are manual annotation fields.
Depending on the current uploaded version, some or all of these fields may be
blank. Blank values should be interpreted as **not yet annotated**, not as an
explicit difficulty judgment.
## Limitations
- The dataset is small and intended for evaluation rather than training.
- It comes from a single game domain and a narrow action space.
- Visual ambiguity can arise from motion blur, repeated textures, and weak
frame-to-frame changes.
- Human difficulty labels are subjective and may evolve across versions.
## Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset(
"HiggsBoson/vizdoom-llm-inverse-dynamics-benchmark",
split="benchmark",
)
print(dataset)
print(dataset.column_names)
print(dataset[0]["action"])
```
## Suggested Evaluation Prompt
One natural evaluation prompt is:
> You are given a short sequence of first-person game frames centered on a
> decision timestep. Predict the single action taken at the center frame from:
> `forward`, `turn_left`, `turn_right`, `strafe_left`, `strafe_right`, `idle`.
This dataset is also compatible with pairwise prompting using only
`image_before` and `image_after`.
## Acknowledgements
This benchmark was built from a VizDoom inverse-dynamics data pipeline and is
intended to support research and course-project style evaluation of LLM/VLM
capabilities on action inference from short visual sequences.
---
pretty_name: VizDoom LLM逆动力学基准数据集
tags:
- 视觉
- 多模态
- 强化学习
- 逆动力学
- VizDoom
- 基准数据集
size_categories:
- n<1K
---
# VizDoom LLM逆动力学基准数据集
## 数据集概述
本数据集为小型可人工核查的基准数据集,用于评估大语言模型(LLM)与视觉语言模型(VLM)在第一人称游戏环境中作为**逆动力学模型**的性能表现。
每个样本均包含以标记时间步`t`为中心的**9帧时序窗口**:
`[x_(t-4), x_(t-3), x_(t-2), x_(t-1), x_t, x_(t+1), x_(t+2), x_(t+3), x_(t+4)] -> a_t`
其中`a_t`为时间`t`时执行的动作。
本基准集仅用于**评估**而非大规模训练,旨在支撑如下研究问题:
- 大语言模型或视觉语言模型能否从短期视觉动态中推断出玩家动作?
- 哪些动作最易或最难区分?
- 模型性能如何随人工标注的难度等级变化?
当前公开版本包含:
- 1个拆分集:`benchmark`(基准拆分)
- 共100个样本
- 每个样本包含9个图像列,对应时序窗口
- 来自小型离散导航动作空间的动作标签
- 可选的难度与难度理由手动标注字段
## 任务框架
本基准集用于评估VizDoom环境中以自我中心视角视觉观测为输入的**逆动力学预测**任务,目标为预测中心帧对应的离散动作。
支持的动作标签包括:
- `forward`(前进)
- `turn_left`(左转)
- `turn_right`(右转)
- `strafe_left`(左平移)
- `strafe_right`(右平移)
- `idle`(待机)
## 数据来源与构建
本基准集基于VizDoom中收集的轨迹构建,采用导航导向的动作空间。每个基准样本均源自带标注的决策时间步,导出内容包括:
- 中心帧对应的动作标签
- 决策前的4帧图像
- 决策时间步对应的帧
- 决策后的4帧图像
- 用于快速人工核查的拼接图像(contact sheet)
本基准集特意采用易于以下操作的格式存储:
- 人工直接核查
- 在电子表格中进行标注
- 导出为Hugging Face的`Image`特征格式
## 数据字段
每个样本包含如下字段:
- `sample_id`:唯一样本标识符
- `sequence_image`:将全部9帧图像并排拼接而成的拼接图
- `image_before`:`frame_t`的别名
- `image_after`:`frame_t_plus_1`的别名
- `frame_t_minus_4`
- `frame_t_minus_3`
- `frame_t_minus_2`
- `frame_t_minus_1`
- `frame_t`
- `frame_t_plus_1`
- `frame_t_plus_2`
- `frame_t_plus_3`
- `frame_t_plus_4`
- `action`:真实动作名称
- `ground_truth_action`:与`action`完全一致的字段,用于兼容旧版评估流程
- `action_id`:整数型动作编号
- `difficulty`:可选的人工感知难度标注
- `difficulty_reason`:针对难度标注的可选简短人工解释
- `episode_id`:源轨迹(episode)标识符
- `split`:拆分集名称
- `label_frame_index`:标注动作在源轨迹中的时间步索引
## 预期用途
本数据集适用于:
- 针对逆动力学预测任务的大语言模型与视觉语言模型基准测试
- 针对短视觉序列的提示工程研究
- 不同动作类型间的误差分析
- 基于人工难度标注的分析
合理的评估设置包括:
- 以全部9帧上下文作为输入进行提示
- 仅使用`image_before`与`image_after`作为两帧基线输入
- 对比时序提示与单帧提示的性能差异
## 非预期用途
本数据集**不适用**于:
- 训练鲁棒的通用智能体策略
- 推断真实世界的人类行为
- 在该窄基准设置之外评估开放世界导航能力
- 仅基于100个样本就得出关于具身推理的泛化结论
## 标注说明
`difficulty`与`difficulty_reason`为手动标注字段。根据当前上传版本,部分或全部字段可能为空。空值应被理解为**尚未完成标注**,而非明确的难度判定。
## 局限性
- 本数据集规模较小,仅用于评估而非训练
- 数据仅源自单一游戏领域与窄动作空间
- 视觉歧义可能源于运动模糊、重复纹理与帧间微弱变化
- 人工难度标注具有主观性,可能随版本迭代更新
## 数据集加载
python
from datasets import load_dataset
dataset = load_dataset(
"HiggsBoson/vizdoom-llm-inverse-dynamics-benchmark",
split="benchmark",
)
print(dataset)
print(dataset.column_names)
print(dataset[0]["action"])
## 推荐评估提示词
一种自然的评估提示词为:
> 你将获得以决策时间步为中心的短序列第一人称游戏帧,请从以下动作中预测中心帧对应的唯一动作:`forward`(前进)、`turn_left`(左转)、`turn_right`(右转)、`strafe_left`(左平移)、`strafe_right`(右平移)、`idle`(待机)。
本数据集也支持仅使用`image_before`与`image_after`的成对提示方式。
## 致谢
本基准集基于VizDoom逆动力学数据流水线构建,旨在支撑针对大语言模型/视觉语言模型从短视觉序列中推断动作的研究与课程项目类评估工作。
提供机构:
HiggsBoson



