wangyz1999/GameplayQA

Name: wangyz1999/GameplayQA
Creator: wangyz1999
Published: 2026-03-26 01:08:55
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/wangyz1999/GameplayQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en pretty_name: GameplayQA tags: - video-question-answering - video-understanding - gameplay-understanding - multi-video - benchmark - ego-centric - multi-agent - esport - game - temporal-reasoning - cross-video task_categories: - video-text-to-text - image-text-to-text - visual-question-answering - question-answering configs: - config_name: default data_files: - split: train path: qa.csv features: - name: id dtype: string - name: instance_id dtype: string - name: game_name dtype: string - name: video_1_file_name dtype: string - name: video_2_file_name dtype: string - name: video_3_file_name dtype: string - name: video_4_file_name dtype: string - name: video_5_file_name dtype: string - name: video_indices dtype: string - name: video_start dtype: int32 - name: video_end dtype: int32 - name: question_code dtype: string - name: task_name dtype: string - name: question_level dtype: int32 - name: question dtype: string - name: correct_option dtype: string - name: answer_start dtype: string - name: answer_end dtype: string - name: distractor_1 dtype: string - name: distractor_1_type dtype: string - name: distractor_2 dtype: string - name: distractor_2_type dtype: string - name: distractor_3 dtype: string - name: distractor_3_type dtype: string - config_name: generalization data_files: - split: train path: qa_generalization.csv features: - name: id dtype: string - name: instance_id dtype: string - name: game_name dtype: string - name: video_1_file_name dtype: string - name: video_2_file_name dtype: string - name: video_3_file_name dtype: string - name: video_4_file_name dtype: string - name: video_5_file_name dtype: string - name: video_indices dtype: string - name: video_start dtype: int32 - name: video_end dtype: int32 - name: question_code dtype: string - name: task_name dtype: string - name: question_level dtype: int32 - name: question dtype: string - name: correct_option dtype: string - name: answer_start dtype: string - name: answer_end dtype: string - name: distractor_1 dtype: string - name: distractor_1_type dtype: string - name: distractor_2 dtype: string - name: distractor_2_type dtype: string - name: distractor_3 dtype: string - name: distractor_3_type dtype: string ---  <h1 align="center"> GameplayQA: A Decision-Dense POV-Synced Multi-Video<br> Understanding Benchmark of 3D Virtual Agents </h1>  <div align="center"> <a href="https://www.linkedin.com/in/yunzhe-wang/">Yunzhe Wang</a> &emsp; <a href="https://www.linkedin.com/in/runhui-xu-a9895a2b3/">Runhui Xu</a> &emsp; <a href="https://www.linkedin.com/in/kexin-zheng-8b5a7a232/">Kexin Zheng</a> &emsp; <a href="https://www.linkedin.com/in/tianyi-mavis-zhang-3b9514128/">Tianyi Zhang</a> </div> <div align="center"> <a href="https://www.linkedin.com/in/jayavibhav/">Jayavibhav N. Kogundi</a> &emsp; <a href="https://www.linkedin.com/in/soham-hans-b6295215a/"> Soham Hans</a> &emsp; <a href="https://www.linkedin.com/in/volkan-ustun-9883a74/">Volkan Ustun</a> </div>  <div align="center"> University of Southern California </div> <div align="center"> <b>Corresponding Author</b>: yunzhewa@usc.edu </div>  <br> <div align="center"> <a href="https://hats-ict.github.io/gameplayqa/"><img src="https://img.shields.io/static/v1?label=GameplayQA%20Project%20Homepage&message=Website&color=9a33fc&logo=githubpages" style="height: 25px;"></a> <a href="https://arxiv.org/abs/2603.24329"><img src="https://img.shields.io/static/v1?label=Paper&message=arXiv&color=FF0066&logo=arxiv" style="height: 25px;"></a> <a href="https://huggingface.co/datasets/wangyz1999/GameplayQA"><img src="https://img.shields.io/static/v1?label=Dataset&message=HuggingFace&color=FF6600&logo=huggingface" style="height: 25px;"></a> <br> <a href="https://github.com/wangyz1999/sync-video-label"><img src="https://img.shields.io/static/v1?label=Annotation%20Tool&message=Github&color=6699FF&logo=github" style="height: 25px;"></a> <a href="https://sync-video-label.vercel.app/"><img src="https://img.shields.io/static/v1?label=Annotation%20Tool&message=Live%20Demo&color=33CCCC&logo=vercel" style="height: 25px;"></a> <a href="https://www.youtube.com/watch?v=PKedELJ4XT0"><img src="https://img.shields.io/static/v1?label=Annotation%20Tool%20Demo&message=YouTube&color=FF0000&logo=youtube" style="height: 25px;"></a> <a href="https://huggingface.co/datasets/wangyz1999/X-EGO-CS"><img src="https://img.shields.io/static/v1?label=Related&message=X-EGO-CS&color=FFCC00&logo=huggingface" style="height: 25px;"></a> </div>  ![Framework](framework.jpg) ## Overview **GameplayQA** is the first benchmark for **POV-Synced Multi-Video Understanding** and **Multi-Agent Video Understanding**, built from ego-centric gameplay footage across 9 commercial 3D games. It features 2.4K multiple-choice questions spanning three levels of reasoning complexity — from single-clip action recognition to synchronized cross-video understanding. ### Why synchronized multi-viewpoint reasoning matters The ability to reason across multiple synchronized viewpoints is critical in many real-world domains: sports analytics leveraging multiple camera angles, autonomous driving requiring sensor fusion from surround cameras, law enforcement reviewing multiple dashcam feeds, and coordinated robot or drone fleets operating in shared environments. In esports and gaming, cross-POV synchronization and collective reasoning are fundamental to interpreting multi-agent collaboration — making gameplay an ideal controlled testbed for developing and evaluating these capabilities in video-language models. <video controls> <source src="https://hats-ict.github.io/projects/gameplayqa/static/videos/synced_pov.mp4" type="video/mp4"> </video> <details> <summary><b>Abstract</b></summary> <br> Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective — capabilities that existing benchmarks do not adequately evaluate. We introduce <b>GameplayQA</b>, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of <b>Self</b>, <b>Other Agents</b>, and the <b>World</b> — a natural decomposition for multi-agent environments. From these annotations, we generate 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling. </details>  <div align="center"> <table> <tr> <td align="center"><b>9</b><br>Games</td> <td align="center"><b>2.4 K</b><br>QA Pairs</td> <td align="center"><b>2,709</b><br>Annotations</td> </tr> <tr> <td align="center"><b>2,219 s</b><br>Annotated Footage</td> <td align="center"><b>1.22 /s</b><br>Decision Density</td> <td align="center"><b>15</b><br>Task Categories</td> </tr> </table> </div>  ## Dataset Files | File | Description | |------|-------------| | `qa.csv` | Main benchmark — 2,365 questions across 9 games | | `qa_generalization.csv` | Generalization benchmark — 213 questions on real-world non-game videos (load with `config="generalization"`) | Raw annotation files and video clips are located in the `annotation/` folder, organized by project batch (e.g. `annotation/project-batch1/`, `annotation/project-multi-batch1/`). Processed benchmark files ready for evaluation are provided as CSV files at the root level. ### Download/Loading the Dataset The QA metadata (CSV files) can be loaded directly via the HuggingFace `datasets` library: ```python from datasets import load_dataset # Main benchmark (default) ds = load_dataset("wangyz1999/GameplayQA") # Generalization benchmark ds = load_dataset("wangyz1999/GameplayQA", "generalization") ``` The raw video clips are stored under `annotation/` and tracked with Git LFS. To download them, clone the full repository: ```bash git lfs install git clone https://huggingface.co/datasets/wangyz1999/GameplayQA ``` Each QA row contains `video_start` / `video_end` timestamps in seconds. Evaluation typically requires cropping the referenced clip to the relevant segment before passing it to a model.  <details> <summary><b>Column Reference</b></summary> | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique question ID (e.g. `q-1767607156491-x4371a218`) | | `instance_id` | string | Video clip instance ID (e.g. `i-052`) | | `game_name` | string | Game title. One of: `ApexLegends`, `ARCRaiders`, `Battlefield6`, `CounterStrike2`, `Cyberpunk2077`, `EldenRing`, `Minecraft`, `NoManSky`, `Valheim` | | `video_1_file_name` … `video_5_file_name` | string \| null | Paths to the video clip(s) (under `annotation/`). Single-video questions populate only `video_1_file_name`; multi-video questions populate up to 5 columns. | | `video_indices` | string | 0-based index/indices of the video(s) used, comma-separated (e.g. `0` or `0,1,2`) | | `video_start` | int | Start time of the relevant video segment (seconds) | | `video_end` | int | End time of the relevant video segment (seconds) | | `question_code` | string | Fine-grained question type code (e.g. `OA-IDENT`, `V1-SA2V2-SA-IDENT`) | | `task_name` | string | Human-readable task category derived from `question_code`. See [Task Categories](#task-categories) below. | | `question_level` | int | Context scope level: `1` = Single Reference, `2` = Temporal, `3` = Cross-Video | | `question` | string | The question text | | `correct_option` | string | The correct answer text | | `answer_start` | int \| null | Start time of the answer evidence segment (seconds). Null for existence/binary questions. | | `answer_end` | int \| null | End time of the answer evidence segment (seconds). Null for existence/binary questions. | | `distractor_1` … `distractor_3` | string \| null | Wrong answer options (up to 3) | | `distractor_1_type` … `distractor_3_type` | string \| null | Distractor source. One of: `lexical`, `temporal`, `role`, `scene`, `binary`, `count`, `order`, `intent`, `cross-video` | </details>  ## Taxonomy ### Entity Types: Self — Other — World GameplayQA organizes perception in interactive 3D environments around a **tripartite entity decomposition** that mirrors how agents must reason in multi-agent settings. <div align="center"> <table> <tr> <td align="center"><img src="sow.jpg" alt="Self-Other-World framework" width="400"/></td> <td align="center"><img src="taxonomy.jpg" alt="Question taxonomy" width="556"/></td> </tr> </table> </div> | Entity | Code | Description | |--------|------|-------------| | **Self** | SA / SS | The POV agent. **Self-Action (SA)** captures what the player *does* (shooting, reloading, jumping). **Self-State (SS)** captures the player's *condition* (health, ammo, equipped weapon). | | **Other** | OA / OS | External agents — teammates, enemies, NPCs. **Other-Action (OA)** and **Other-State (OS)** mirror the Self primitives for other autonomous entities. | | **World** | WO / WE | The shared environment. **World-Object (WO)** covers static or interactive elements (supply crates, vehicles, landmarks). **World-Event (WE)** covers dynamic occurrences (explosions, system notifications, environmental triggers). | ### Cognitive Levels Questions are organized by the amount of temporal and cross-video context required: | Level | Scope | What it tests | |:---:|---|---| | **L1** | Single Reference | Basic perception within a single video segment — recognizing what happened, what state something was in, or what object/event was present. | | **L2** | Temporal | Reasoning that requires grounding across time — linking entities across different moments, localizing when something happened, identifying absences, counting occurrences, ordering events, or inferring intent. | | **L3** | Cross-Video | Reasoning across multiple synchronized POV videos — matching events across perspectives, ordering events across videos, or identifying which POV performed a given action. | ### Task Categories Questions are organized into 15 task categories across 3 context scope levels (2,365 total questions). | Scope | Task | Description | Example Codes | #Q | Avg Dur. | |---|---|---|---|---:|---:| | **Single Reference** (L1, 469) | Action Recognition | Identify or verify existence of self & others' actions | `SA-IDENT`, `OA-EXIST`, ... | 162 | 10.0s | | | State Recognition | Identify or verify existence of self & others' states | `SS-IDENT`, `OS-EXIST`, ... | 147 | 10.1s | | | Object Recognition | Identify or verify existence of world objects in scene | `WO-IDENT`, `WO-EXIST` | 70 | 9.3s | | | Event Recognition | Identify world events occurring in the environment | `WE-IDENT` | 61 | 8.4s | | | Static Object Count | Count static objects present in the scene | `WO-COUNT` | 29 | 21.3s | | **Temporal** (L2, 1383) | Cross-Entity Referring | Link one entity to another (X2Y reasoning) | `SA2SS-IDENT`, `WO2SS-EXIST`, ... | 423 | 23.0s | | | Timestamp Referring | Given time range [t1–t2], identify what entity exists | `TR2SS-IDENT`, `TR2SA-IDENT`, ... | 81 | 24.3s | | | Time Localization | Locate exact timestamp when an event occurred | `SA-TIME`, `WE-TIME`, ... | 281 | 28.4s | | | Absence Recognition | Identify actions/states that did not occur over a timespan | `SA-ABSENT`, `SS-ABSENT`, ... | 195 | 38.9s | | | Occurrence Count | Count how many times an action/event happened | `SA-COUNT`, `OA-COUNT`, `WE-COUNT` | 75 | 26.4s | | | Ordering | Determine temporal order sequence of actions/events | `SA-ORDER`, `OA-ORDER`, `MIX-ORDER` | 180 | 32.6s | | | Intent Identification | Identify underlying intent or goal behind actions | `SA-INTENT`, `OA-INTENT` | 148 | 23.0s | | **Cross-Video** (L3, 513) | Sync-Referring | Link corresponding entities across synchronized videos | `V1-SA2-V2OA`, `V1-WO2-V2SS`, ... | 207 | 94.7s | | | Cross-Video Ordering | Determine event order sequence across multiple videos | `SA-ORDER-MV`, `MIX-ORDER-MV`, ... | 117 | 110.0s | | | POV Identification | Identify who performed what action in which video | `SA-POV-ID`, `OA-POV-ID`, ... | 189 | 91.6s |  ## Source Data GameplayQA is built from first-person gameplay footage sourced from 9 commercially released multiplayer games spanning diverse genres: - **Single-POV games:** Minecraft, Apex Legends, No Man's Sky, Elden Ring, Cyberpunk 2077, Valheim - **Multi-POV synchronized games:** Counter-Strike 2, Battlefield 6, ARC Raiders Videos were sourced from YouTube, Twitch streams, and existing datasets (Counter-Strike 2 footage from [X-EGO-CS](https://huggingface.co/datasets/wangyz1999/X-EGO-CS)). For multi-POV games, groups of streamers who played together in the same match were identified and their individual recordings were manually time-aligned to construct temporally synchronized multi-video sets.  ## Question Generation Questions are generated through a combinatorial template-based algorithm that systematically combines verified annotation labels across five orthogonal dimensions: number of videos (single/multi), context target (summative/timestamp/entity/cross-video referring), entity type (SA/SS/OA/OS/WO/WE), distractor type, and question form. The algorithm initially produces ~400K candidate QA pairs; strategic downsampling to 4K enforces balanced category coverage before quality assurance yields the final 2,365 gold-standard pairs. <details> <summary><b>Distractor Types</b></summary> | Type | Description | |------|-------------| | `lexical` | Text-based variants of the correct answer (synonyms, antonyms, attribute changes) | | `temporal` | Events that did occur, but outside the queried time window | | `role` | Correct event but attributed to the wrong agent | | `scene` | Plausible events that never occurred in the video | | `cross-video` | Events from a different synchronized video perspective | | `binary` | Negation of a binary (true/false) answer | | `count` | Incorrect quantity for counting questions | | `order` | Incorrect temporal ordering of events | | `intent` | Alternative plausible motivations for an action | </details>  ## Quality Assurance **Language prior filtering:** Each question is queried with text only (no video) using Gemini Flash with k=3 trials. Questions where the model consistently selects the correct answer without visual grounding are removed to prevent exploitation of statistical regularities in question phrasing. **Human evaluation:** A stratified sample of 120 questions covering all question types was reviewed by annotators, who verified that (1) the video contains exactly one unambiguous correct answer, and (2) the question adheres to the semantics of its question code. Questions flagged as faulty (~8%) were corrected or removed.  ## Annotations ### Annotation Process Videos were annotated using **dense multi-track timeline captioning** via a custom-built annotation tool — **[sync-video-label](https://github.com/wangyz1999/sync-video-label)**. Each of the six entity types (SA, SS, OA, OS, WO, WE) is treated as an independent annotation track, and labels within and across tracks can overlap temporally to capture concurrent events. The process follows a two-stage human-in-the-loop workflow: 1. **Stage 1 — AI-assisted generation + human verification:** Gemini Pro generates candidate labels and distractors (3,632 predictions). Four graduate student annotators then verify and refine: 31.1% of predicted labels were deleted, 42.7% were edited (61.9% requiring caption changes, 42.2% requiring temporal boundary adjustments), and 7.6% of final labels were added manually. 2. **Stage 2 — Independent review:** A separate annotator reviews all labels, making further adjustments to ~12% of labels. A live read-only demo of the annotation interface is available at [sync-video-label.vercel.app](https://sync-video-label.vercel.app/). See the [demo video](https://www.youtube.com/watch?v=PKedELJ4XT0) for a walkthrough. ### Annotators The annotation team consisted of 5 graduate students (ages 21–31). All annotators were experienced gamers: 60% play 3–5 times per week, 60% have 8+ years of gaming experience. Roles were distributed as 4 labelers and 2 evaluators, with one participant serving in both capacities for cross-stage consistency. ### Label Statistics A total of 2,709 true labels were annotated across 2,219 seconds of footage, yielding a decision density of **ρ ≈ 1.22 labels/second** — roughly one decision-relevant event per second. | Label Type | Count | Share | |------------|------:|------:| | Self-Action (SA) | 658 | 24.3% | | Self-State (SS) | 729 | 26.9% | | Other-Action (OA) | 160 | 5.9% | | Other-State (OS) | 190 | 7.0% | | World-Event (WE) | 417 | 15.4% | | World-Object (WO) | 555 | 20.5% | | **Total** | **2,709** | **100%** |  ## Related Datasets - **[wangyz1999/X-EGO-CS](https://huggingface.co/datasets/wangyz1999/X-EGO-CS)** — Cross-ego synchronized gameplay data from Counter-Strike 2, used as the source for the `CounterStrike2` split in this benchmark.  ## Citation **If our research is helpful to you, please cite our paper:** ```bibtex @article{wang2026gameplayqa, title = {GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents}, author = {Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and Hans, Soham and Ustun, Volkan}, year = {2026}, eprint = {2603.24329}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2603.24329} }

提供机构：

wangyz1999

5,000+

优质数据集

54 个

任务类型

进入经典数据集