five

GameFactory-Dataset

收藏
魔搭社区2025-11-27 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/KwaiVGI/GameFactory-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1>GameFactory: Creating New Games with Generative Interactive Videos</h1> <p> <a href="https://yujiwen.github.io/gamefactory">[Project page]</a> <a href="https://arxiv.org/abs/2501.08325">[ArXiv]</a> <a href="https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset">[Dataset]</a> </p> </div> <div align="center"> **[Jiwen Yu<sup>1*&dagger;</sup>](https://yujiwen.github.io/), [Yiran Qin<sup>1*</sup>](https://github.com/IranQin), <br> [Xintao Wang<sup>2&ddagger;</sup>](https://xinntao.github.io/), [Pengfei Wan<sup>2</sup>](https://scholar.google.com/citations?user=P6MraaYAAAAJ&hl=en), [Di Zhang<sup>2</sup>](https://openreview.net/profile?id=~Di_ZHANG3), [Xihui Liu<sup>1&ddagger;</sup>](https://xh-liu.github.io/)** <br> <sup>1</sup>The University of Hong Kong <sup>2</sup>Kuaishou Technology <br> &dagger;: Intern at KwaiVGI, Kuaishou Technology, *: Equal Contribution, &ddagger;: Corresponding Authors </div> ## 🚀 GF-Minecraft Dataset ### 1. Dataset Introduction The [**GF-Minecraft Dataset**](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset) is designed to meet three key requirements for action-controllable video generation: 1. **Customizable actions** for cost-effective, large-scale data collection. 2. **Unbiased action sequences** to ensure diverse and low-probability action combinations. 3. **Diverse scenes** with textual descriptions to capture scene-specific physical dynamics. We use [**Minecraft**](https://minedojo.org/) as the data collection platform due to its comprehensive API, diverse open-world environment, and extensive action space. By executing predefined, randomized action sequences, we collected **70 hours of gameplay video** with action annotations. To enhance diversity, we preconfigured three biomes (forest, plains, desert), three weather conditions (clear, rain, thunder), and six times of day (e.g., sunrise, noon, midnight), resulting in more than **2,000 video clips**. Each clip contains **2,000 frames** and is paired with textual descriptions generated by the multimodal language model [**MiniCPM-V**](https://github.com/OpenBMB/MiniCPM-V) (examples shown below). This dataset provides a strong foundation for training action-controllable and generalizable game video generation models. ### 2. File Structure The **GF-Minecraft Dataset** can be downloaded from [**HuggingFace**](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset). Upon download, the dataset will be organized as follows: ``` GF-Minecraft ├── data_2003 │ ├── part_aa │ ├── part_ab │ ├── part_ac │ ├── part_ad │ ├── part_ae │ └── part_af └── data_269.zip ``` To prepare the dataset for use, navigate to the `data_2003` folder and merge the parts into a single zip file using the following command: ```bash cat part_* > data_2003.zip ``` After extracting `data_2003.zip` and `data_269.zip`, the dataset will be organized as follows: ``` GF-Minecraft ├── data_2003 │   ├── annotation.csv │   ├── metadata │   │   ├── seed_1_part_1.json │   │   ├── seed_2_part_2.json │   │   ├── seed_3_part_3.json │   │   └── ... │   └── video │   ├── seed_1_part_1.mp4 │   ├── seed_2_part_2.mp4 │   ├── seed_3_part_3.mp4 │   └── ... └── data_269 ├── annotation.csv ├── metadata │   ├── seed_1_part_1.json │   ├── seed_2_part_2.json │   ├── seed_3_part_3.json │   └── ... └── video ├── seed_1_part_1.mp4 ├── seed_2_part_2.mp4 ├── seed_3_part_3.mp4 └── ... ``` We have also placed a file `sample-10.zip`([link](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset/blob/main/GF-Minecraft/sample-10.zip)) in the `GF-Minecraft/` directory, which contains 5 video files and their corresponding metadata from both `data_2003/` and `data_269/` folders. This can be used for quick reference of the file format. #### Directory Details 1. **`annotation.csv`**: A CSV file containing the textual descriptions for all video clips. Each row corresponds to a video clip and includes the following columns: - **Original video name**: The name of the original video from which the clip is extracted. - **Start frame index**: The starting frame of the clip within the original video. - **End frame index**: The ending frame of the clip within the original video. - **Prompt**: The textual description associated with the clip. 2. **`metadata/`**: A folder containing JSON files with detailed metadata for each video clip. 3. **`video/`**: A folder containing the video files in `.mp4` format. The filenames (e.g., `seed_1_part_1.mp4`) correspond to their associated metadata and annotation records. #### Explanation of Dataset Parts - **`data_2003/`**: Contains the first part of the dataset, including both mouse movement actions and keyboard actions. - **`data_269/`**: Contains the second part of the dataset, similarly structured to `data_2003/`, but includes only keyboard actions. ### 3. JSON File Details #### Example JSON: ```json { "biome": "plains", "initial_weather": "rain", "start_time": "Sunset", "actions": { "0": { "ws": 2, "ad": 0, "scs": 3, "pitch": 0.0, "yaw": 0.0, "pitch_delta": 0.0, "yaw_delta": 0.0, "pos": [-228.5, 75.0, 246.4] }, "1": { "ws": 2, "ad": 1, "scs": 3, "pitch": 0.0, "yaw": 0.0, "pitch_delta": 0.0, "yaw_delta": 0.0, "pos": [-228.43, 75.0, 246.3] } } } ``` Each JSON file in the `metadata/` folder provides detailed metadata for a corresponding video clip. The **most important information in the JSON file is the `actions` field**, which describes the sequence of actions executed during the video. Below are the key details: - **actions**: A dictionary indexed by timestamps (e.g., `"0"`, `"1"`, etc.) representing the sequence of actions. Each video contains **2,000 frames**, and the actions for frames `1` to `1,999` correspond to the information in entries `"1"` to `"1999"` in the `actions` dictionary. The information in the `"0"` entry can be ignored as it does not correspond to any frame in the video. Each action entry includes: - **`ws`**: Encodes forward (`1`), backward (`2`), or no movement (`0`) along the W/S axis. - **`ad`**: Encodes left (`1`), right (`2`), or no movement (`0`) along the A/D axis. - **`scs`**: **`scs`**: Represents special control states, including jumping (space key, `1`), sneaking (shift key, `2`), sprinting (ctrl key, `3`), or no action (`0`). - **`pitch`**: The vertical angle of the camera. - **`yaw`**: The horizontal angle of the camera. - **`pitch_delta`** and **`yaw_delta`**: Changes in pitch and yaw between consecutive frames. These values need to be multiplied by `15` to convert them into degrees. - **`pos`**: A 3D coordinate `[x, y, z]` representing the agent's position in the game world. Other fields in the JSON file provide context for the actions: - **biome**: Specifies the biome type where the video was recorded (`plains`, `forest`, or `desert`). - **initial_weather**: Describes the weather condition at the start of the video (`clear`, `rain`, or `thunder`). - **start_time**: Indicates the time of day at the start of the video (`"Starting of a day"`, `"Noon, sun is at its peak"`, `"Sunset"`, `"Beginning of night"`, `"Midnight, moon is at its peak"`, `"Beginning of sunrise"`). ### 4. Useful scripts #### Invalid Jump and Collision Detection The `detection.py` script processes all JSON files in the specified `metadata` directory to detect and mark collisions and invalid jumps. The updated JSON files are saved in a new `metadata-detection` directory. Run the script with the following command: ```bash python detection.py --dir_name Your_Directory_Root ``` Ensure the directory specified in `--dir_name` contains the following subdirectories: - `video/`: Contains the video files. - `metadata/`: Contains the JSON files to be processed. #### Why Detect Invalid Jumps and Collisions? **Invalid Jumps**: During data collection, the agent sometimes receives a jump action for several consecutive frames. However, once the agent is in the air, the jump action becomes ineffective—this is what we call an "invalid jump." By detecting and removing these invalid jump actions in the metadata, we simplify the learning process for the model by ensuring it only processes valid and meaningful actions. **Collisions**: Collision detection provides additional information about the agent's interaction with the environment. Collisions, such as the agent hitting a wall or an obstacle, can be treated as a unique action signal. Incorporating this information into the metadata helps the model better understand environmental constraints and improves its ability to learn action dynamics. Of course, it is also possible to not provide this information and let the network learn it by itself. #### Action Visualization The provided script `visualize.py` allows users to annotate input videos with action information and save the output as an annotated video. Simply run the script directly to execute the visualization process: ```bash python visualize.py ``` The script uses a predefined action format, where actions are described as a list of entries. Each entry includes: - A frame range for which the action is active. - A string encoding the specific action details. - Optionally, a list of specific frames where the space key (jump) is pressed. For example `[[25, "0 0 0 0 0 0 0 0 0.5"], [77, "1 0 0 0 0 0 0 0 0"], "15 30 50"]`: - `[25, "0 0 0 0 0 0 0 0 0.5"]` indicates an action lasting until frame 25 with specific movement and control states. - `[77, "1 0 0 0 0 0 0 0 0"]` specifies a new action starting from frame 26 and lasting until frame 77. - `"15 30 50"` lists the frames where the space key (jump) is pressed, such as frames 15, 30, and 50. The action string consists of `"w s a d shift ctrl collision delta_pitch delta_yaw"`

<div align="center"> <h1>GameFactory:基于生成式交互视频的游戏创作</h1> <p> <a href="https://yujiwen.github.io/gamefactory">[项目页面]</a> <a href="https://arxiv.org/abs/2501.08325">[ArXiv论文]</a> <a href="https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset">[数据集]</a> </p> </div> <div align="center"> **[于吉文<sup>1*&dagger;</sup>](https://yujiwen.github.io/), [秦依然<sup>1*</sup>](https://github.com/IranQin), <br> [王鑫涛<sup>2&ddagger;</sup>](https://xinntao.github.io/), [万鹏飞<sup>2</sup>](https://scholar.google.com/citations?user=P6MraaYAAAAJ&hl=en), [张迪<sup>2</sup>](https://openreview.net/profile?id=~Di_ZHANG3), [刘锡辉<sup>1&ddagger;</sup>](https://xh-liu.github.io/)** <br> <sup>1</sup>香港大学 <sup>2</sup>快手科技 <br> &dagger;: 快手科技KwaiVGI实习生,*: 共同第一作者,&ddagger;: 通讯作者 </div> ## 🚀 GF-Minecraft 数据集 ### 1. 数据集简介 [**GF-Minecraft 数据集**](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset) 旨在满足动作可控视频生成的三项核心需求: 1. **可自定义动作**:支持低成本、大规模数据采集; 2. **无偏动作序列**:确保动作组合具备多样性与低概率性; 3. **带文本描述的多样化场景**:用以捕捉场景专属的物理动态。 我们选用[**Minecraft**](https://minedojo.org/)作为数据采集平台,因其拥有完备的应用程序接口(API)、丰富的开放世界环境与海量动作空间。通过执行预定义的随机化动作序列,我们共采集得到**70小时带动作标注的游戏视频**。 为提升数据集多样性,我们预设了三种生物群系(森林、平原、沙漠)、三种天气状况(晴朗、降雨、雷暴)以及六个时段(如日出、正午、午夜),最终生成超过**2000个视频片段**。每个片段包含**2000帧画面**,并搭配由多模态语言模型[**MiniCPM-V**](https://github.com/OpenBMB/MiniCPM-V)生成的文本描述(示例见下文)。本数据集为训练动作可控且具备泛化能力的游戏视频生成模型提供了坚实基础。 ### 2. 文件结构 **GF-Minecraft 数据集**可从[**HuggingFace**](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset)下载。下载完成后,数据集组织形式如下: GF-Minecraft ├── data_2003 │ ├── part_aa │ ├── part_ab │ ├── part_ac │ ├── part_ad │ ├── part_ae │ └── part_af └── data_269.zip 如需准备数据集以供使用,请进入`data_2003`文件夹,执行如下命令将分卷合并为单个压缩包: bash cat part_* > data_2003.zip 解压`data_2003.zip`与`data_269.zip`后,数据集组织形式如下: GF-Minecraft ├── data_2003 │ ├── annotation.csv │ ├── metadata │ │ ├── seed_1_part_1.json │ │ ├── seed_2_part_2.json │ │ ├── seed_3_part_3.json │ │ └── ... │ └── video │ ├── seed_1_part_1.mp4 │ ├── seed_2_part_2.mp4 │ ├── seed_3_part_3.mp4 │ └── ... └── data_269 ├── annotation.csv ├── metadata │ ├── seed_1_part_1.json │ ├── seed_2_part_2.json │ ├── seed_3_part_3.json │ └── ... └── video ├── seed_1_part_1.mp4 ├── seed_2_part_2.mp4 ├── seed_3_part_3.mp4 └── ... 我们还在`GF-Minecraft/`目录下提供了`sample-10.zip`文件([链接](https://huggingface.co/datasets/KwaiVGI/GameFactory-Dataset/blob/main/GF-Minecraft/sample-10.zip)),其中包含来自`data_2003/`与`data_269/`文件夹的5个视频文件及其对应元数据,可用于快速了解数据集文件格式。 #### 目录细节说明 1. **`annotation.csv`**:包含所有视频片段文本描述的CSV文件。每一行对应一个视频片段,包含以下列: - **原始视频名称**:提取该片段的原始视频文件名; - **起始帧索引**:该片段在原始视频中的起始帧位置; - **结束帧索引**:该片段在原始视频中的结束帧位置; - **提示词(Prompt)**:与该片段关联的文本描述。 2. **`metadata/`**:存储每个视频片段详细元数据的JSON文件所在文件夹。 3. **`video/`**:存储`.mp4`格式视频文件的文件夹。文件名(如`seed_1_part_1.mp4`)与对应的元数据和注释记录一一对应。 #### 数据集分卷说明 - **`data_2003/`**:包含数据集第一部分,涵盖鼠标移动动作与键盘动作; - **`data_269/`**:包含数据集第二部分,结构与`data_2003/`一致,但仅包含键盘动作。 ### 3. JSON 文件详情 #### 示例 JSON: json { "biome": "plains", "initial_weather": "rain", "start_time": "Sunset", "actions": { "0": { "ws": 2, "ad": 0, "scs": 3, "pitch": 0.0, "yaw": 0.0, "pitch_delta": 0.0, "yaw_delta": 0.0, "pos": [-228.5, 75.0, 246.4] }, "1": { "ws": 2, "ad": 1, "scs": 3, "pitch": 0.0, "yaw": 0.0, "pitch_delta": 0.0, "yaw_delta": 0.0, "pos": [-228.43, 75.0, 246.3] } } } `metadata/`文件夹中的每个JSON文件均为对应视频片段提供详细元数据。其中**JSON文件中最重要的字段为`actions`**,用于描述视频中的动作序列。以下为各字段的详细说明: - **`actions`**:以时间戳(如`"0"`、`"1"`等)为索引的字典,代表动作序列。每个视频包含**2000帧画面**,第1至1999帧的动作对应`actions`字典中`"1"`至`"1999"`条目,`"0"`条目可忽略,因其不对应视频中任何帧。每个动作条目包含: - **`ws`**:编码W/S轴方向的移动状态,向前为`1`、向后为`2`、无移动为`0`; - **`ad`**:编码A/D轴方向的移动状态,向左为`1`、向右为`2`、无移动为`0`; - **`scs`**:代表特殊控制状态,包括跳跃(空格键,`1`)、潜行(Shift键,`2`)、冲刺(Ctrl键,`3`)或无动作(`0`); - **`pitch`**:相机垂直角度; - **`yaw`**:相机水平角度; - **`pitch_delta`**与**`yaw_delta`**:相邻帧之间的俯仰角与偏航角变化量,需乘以`15`转换为角度值; - **`pos`**:三维坐标`[x, y, z]`,代表游戏中智能体的位置。 JSON文件中的其他字段提供动作相关的上下文信息: - **`biome`**:指定视频录制所在的生物群系类型(`plains`、`forest`或`desert`); - **`initial_weather`**:视频起始时的天气状况(`clear`、`rain`或`thunder`); - **`start_time`**:视频起始时的时段,包括`"Starting of a day"`、`"Noon, sun is at its peak"`、`"Sunset"`、`"Beginning of night"`、`"Midnight, moon is at its peak"`、`"Beginning of sunrise"`。 ### 4. 实用脚本 #### 无效跳跃与碰撞检测 `detection.py`脚本可处理指定`metadata`目录下的所有JSON文件,检测并标记无效跳跃与碰撞情况。更新后的JSON文件将保存至新的`metadata-detection`目录。 执行该脚本的命令如下: bash python detection.py --dir_name Your_Directory_Root 请确保`--dir_name`指定的目录包含以下子目录: - `video/`:存储视频文件; - `metadata/`:存储待处理的JSON文件。 #### 为何需要检测无效跳跃与碰撞? **无效跳跃**:在数据采集过程中,智能体有时会收到连续多帧的跳跃指令。但当智能体处于空中时,跳跃指令将不再生效,此即为“无效跳跃”。通过检测并移除元数据中的无效跳跃动作,我们可以简化模型的学习过程,确保模型仅处理有效且有意义的动作。 **碰撞检测**:碰撞检测可提供智能体与环境交互的额外信息。诸如智能体碰撞墙壁或障碍物的情况,可被视为独特的动作信号。将此类信息纳入元数据可帮助模型更好地理解环境约束,提升其学习动作动态的能力。当然,也可以不提供该信息,让神经网络自主学习。 #### 动作可视化 提供的`visualize.py`脚本可用于为输入视频添加动作标注,并将结果保存为带标注的视频。直接执行该脚本即可启动可视化流程: bash python visualize.py 该脚本使用预定义的动作格式,其中动作以条目列表形式描述。每个条目包含: - 动作生效的帧范围; - 编码具体动作细节的字符串; - 可选:空格键(跳跃)被按下的特定帧列表。 例如`[[25, "0 0 0 0 0 0 0 0 0.5"], [77, "1 0 0 0 0 0 0 0 0"], "15 30 50"]`: - `[25, "0 0 0 0 0 0 0 0 0.5"]`表示持续至第25帧的动作,包含特定的移动与控制状态; - `[77, "1 0 0 0 0 0 0 0 0"]`指定从第26帧开始、持续至第77帧的新动作; - `"15 30 50"`列出了空格键(跳跃)被按下的帧,即第15、30和50帧。 动作字符串由`"w s a d shift ctrl collision delta_pitch delta_yaw"`组成。
提供机构:
maas
创建时间:
2025-09-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作