stepfun-ai/GEBench
收藏Hugging Face2026-02-25 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/stepfun-ai/GEBench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
license: apache-2.0
size_categories:
- n<1K
task_categories:
- text-to-image
- image-text-to-image
- image-to-image
pretty_name: GEBench
tags:
- GUI
- benchmark
- temporal-coherence
- interaction
- image-generation
arxiv: 2602.09007
viewer: false
---
# GEBench: Comprehensive Benchmark for Evaluating Dynamic Interaction and Temporal Coherence in GUI Generation
<div align="center">
[](https://arxiv.org/pdf/2602.09007)
[](https://stepfun-ai.github.io/GEBench/)
[](https://huggingface.co/datasets/stepfun-ai/GEBench)
[](LICENSE)
</div>

## Overview
Recent advancements in image generation models enable the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored.
To address this gap, we introduce **GEBench**, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUIs generation. **GEBench** comprises **700** carefully curated samples spanning five task categories, covering both **single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization**. To support systematic evaluation, we propose **GE-Score**, a five-dimensional metric that assesses **Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality**.
Extensive evaluation indicates that current models perform well on single-step transitions but struggle with temporal coherence and spatial grounding over longer interaction sequences. Moreover, our findings identify icon interpretation, text rendering, and localization precision as key bottlenecks, and suggest promising directions for future research toward high-fidelity generative GUI environments.
## 📂 Dataset Structure
The data is organized into five types reflecting different evaluation scenarios:
```
data/
├── 01_single_step/ # Type 1: Single-step interactions
├── 02_mutli_step/ # Type 2: Multi-step interaction trajectories
├── 03_trajectory_text_fictionalapp/ # Type 3: Trajectories for fictional applications
├── 04_trajectory_text_realapp/ # Type 4: Trajectories for real-world applications
└── 05_grounding_data/ # Type 5: Grounding point localization data
```
## Main Results
### Chinese Subset Results
<table>
<tr style="background-color: #f0f0f0;">
<th>Model</th>
<th>Single-Step</th>
<th>Multi-Step</th>
<th>Fiction-App</th>
<th>Real-App</th>
<th>Grounding</th>
<th>GE Score</th>
</tr>
<tr>
<td><strong>Nano Banana pro</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>84.50</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>68.65</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>65.75</strong></td>
<td style="background-color: #F5DEB3;">64.35</td>
<td style="background-color: #FFB81C; color: black;"><strong>64.83</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>69.62</strong></td>
</tr>
<tr>
<td>Nano Banana</td>
<td>64.36</td>
<td>34.16</td>
<td style="background-color: #F5DEB3;">64.82</td>
<td style="background-color: #FFB81C; color: black;"><strong>65.89</strong></td>
<td>54.48</td>
<td>56.74</td>
</tr>
<tr>
<td><strong>GPT-image-1.5</strong></td>
<td style="background-color: #F5DEB3;">83.79</td>
<td style="background-color: #F5DEB3;">56.97</td>
<td>60.11</td>
<td>55.65</td>
<td>53.33</td>
<td style="background-color: #F5DEB3;">63.22</td>
</tr>
<tr>
<td>GPT-image-1.0</td>
<td>64.72</td>
<td>49.20</td>
<td>57.31</td>
<td>59.04</td>
<td>31.68</td>
<td>52.39</td>
</tr>
<tr>
<td>Seedream 4.5</td>
<td>63.64</td>
<td>53.11</td>
<td>56.48</td>
<td>53.44</td>
<td>52.90</td>
<td>55.91</td>
</tr>
<tr>
<td>Seedream 4.0</td>
<td>62.04</td>
<td>48.64</td>
<td>49.28</td>
<td>50.93</td>
<td>53.53</td>
<td>52.88</td>
</tr>
<tr>
<td>Wan 2.6</td>
<td>64.20</td>
<td>50.11</td>
<td>52.72</td>
<td>50.40</td>
<td style="background-color: #F5DEB3;">59.58</td>
<td>55.40</td>
</tr>
<tr>
<td>Flux-2-pro</td>
<td>68.83</td>
<td>55.07</td>
<td>58.13</td>
<td>55.41</td>
<td>50.24</td>
<td>57.54</td>
</tr>
<tr>
<td>Bagel</td>
<td>34.84</td>
<td>13.45</td>
<td>27.36</td>
<td>33.52</td>
<td>35.10</td>
<td>28.85</td>
</tr>
<tr>
<td>UniWorld-V2</td>
<td>55.33</td>
<td>24.95</td>
<td>32.03</td>
<td>21.39</td>
<td>49.60</td>
<td>36.66</td>
</tr>
<tr>
<td>Qwen-Image-Edit</td>
<td>41.12</td>
<td>26.79</td>
<td>23.78</td>
<td>26.10</td>
<td>50.80</td>
<td>33.72</td>
</tr>
<tr>
<td>Longcat-Image</td>
<td>48.76</td>
<td>12.75</td>
<td>30.03</td>
<td>30.00</td>
<td>51.02</td>
<td>34.51</td>
</tr>
</table>
### English Subset Results
<table>
<tr style="background-color: #f0f0f0;">
<th>Model</th>
<th>Single-Step</th>
<th>Multi-Step</th>
<th>Fiction-App</th>
<th>Real-App</th>
<th>Grounding</th>
<th>GE Score</th>
</tr>
<tr>
<td><strong>Nano Banana pro</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>84.32</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>69.51</strong></td>
<td>46.33</td>
<td>47.20</td>
<td style="background-color: #FFB81C; color: black;"><strong>58.64</strong></td>
<td style="background-color: #F5DEB3;">61.20</td>
</tr>
<tr>
<td>Nano Banana</td>
<td>64.80</td>
<td>50.75</td>
<td>48.88</td>
<td>47.12</td>
<td>49.04</td>
<td>52.12</td>
</tr>
<tr>
<td><strong>GPT-image-1.5</strong></td>
<td style="background-color: #F5DEB3;">80.80</td>
<td>58.87</td>
<td style="background-color: #FFB81C; color: black;"><strong>63.68</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>58.93</strong></td>
<td>49.23</td>
<td style="background-color: #FFB81C; color: black;"><strong>63.16</strong></td>
</tr>
<tr>
<td>GPT-image-1.0</td>
<td>60.92</td>
<td style="background-color: #F5DEB3;">64.33</td>
<td style="background-color: #F5DEB3;">58.94</td>
<td style="background-color: #F5DEB3;">56.16</td>
<td>37.84</td>
<td>55.64</td>
</tr>
<tr>
<td>Seedream 4.5</td>
<td>49.49</td>
<td>45.30</td>
<td>53.81</td>
<td>51.80</td>
<td>49.63</td>
<td>50.01</td>
</tr>
<tr>
<td>Seedream 4.0</td>
<td>53.28</td>
<td>37.57</td>
<td>47.92</td>
<td>49.36</td>
<td>44.17</td>
<td>46.46</td>
</tr>
<tr>
<td>Wan 2.6</td>
<td>60.17</td>
<td>44.36</td>
<td>49.55</td>
<td>44.80</td>
<td>53.36</td>
<td>50.45</td>
</tr>
<tr>
<td>Flux-2-pro</td>
<td>61.00</td>
<td>52.17</td>
<td>49.92</td>
<td>47.16</td>
<td>45.67</td>
<td>51.18</td>
</tr>
<tr>
<td>Bagel</td>
<td>32.91</td>
<td>8.61</td>
<td>26.08</td>
<td>35.12</td>
<td>37.30</td>
<td>28.00</td>
</tr>
<tr>
<td>UniWorld-V2</td>
<td>42.68</td>
<td>14.14</td>
<td>30.08</td>
<td>26.83</td>
<td>47.04</td>
<td>32.15</td>
</tr>
<tr>
<td>Qwen-Image-Edit</td>
<td>40.12</td>
<td>18.61</td>
<td>25.80</td>
<td>25.95</td>
<td style="background-color: #F5DEB3;">54.55</td>
<td>33.01</td>
</tr>
<tr>
<td>Longcat-Image</td>
<td>36.69</td>
<td>8.44</td>
<td>37.30</td>
<td>36.83</td>
<td>47.12</td>
<td>33.28</td>
</tr>
</table>
**Legend:** <span style="background-color: #FFB81C; padding: 2px 6px;">Orange (🥇 Top 1)</span> and <span style="background-color: #F5DEB3; padding: 2px 6px;">Champagne (🥈 Top 2)</span> indicate the best performers.
## Citation
If you find GEBench useful for your research, please cite:
```bibtex
@article{li2026gebench,
title={GEBench: Benchmarking Image Generation Models as GUI Environments},
author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
journal={arXiv preprint arXiv:2602.09007},
year={2026}
}
```
language:
- en
- zh
license: apache-2.0
size_categories:
- n<1K
task_categories:
- text-to-image
- image-text-to-image
- image-to-image
pretty_name: GEBench
tags:
- GUI
- benchmark
- temporal-coherence
- interaction
- image-generation
arxiv: 2602.09007
viewer: false
---
# GEBench: 用于评估图形用户界面(Graphical User Interface, GUI)生成中动态交互与时间一致性的综合基准测试
<div align="center">
[](https://arxiv.org/pdf/2602.09007)
[](https://stepfun-ai.github.io/GEBench/)
[](https://huggingface.co/datasets/stepfun-ai/GEBench)
[](LICENSE)
</div>

## 概述
当前图像生成模型的技术进展,已可基于用户指令预测未来的图形用户界面(GUI)状态。然而,现有基准测试主要聚焦于通用领域的视觉保真度,针对GUI特定场景下的状态转换与时间一致性的评估仍未得到充分探索。
为填补这一研究空白,我们推出**GEBench**——一款用于评估GUI生成中动态交互与时间一致性的综合基准测试。**GEBench**包含700份精心筛选的样本,涵盖五大任务类别,既包含单步交互与覆盖真实世界、虚构场景的多步交互轨迹,也包含锚点定位任务。
为支持系统化评估,我们提出**GE-Score**这一五维评估指标,分别从目标达成度、交互逻辑、内容一致性、界面合理性与视觉质量五个维度进行评测。
大量实验评估表明,当前模型在单步转换任务上表现良好,但在更长的交互序列中难以保证时间一致性与空间锚定准确性。此外,我们的研究发现图标识别、文本渲染与定位精度是当前模型的核心瓶颈,并为未来构建高保真生成式GUI环境的研究指明了可行方向。
## 📂 数据集结构
数据集按照五大类别进行组织,对应不同的评估场景:
data/
├── 01_single_step/ # 类型1:单步交互
├── 02_multi_step/ # 类型2:多步交互轨迹
├── 03_trajectory_text_fictionalapp/ # 类型3:虚构应用的交互轨迹
├── 04_trajectory_text_realapp/ # 类型4:真实应用的交互轨迹
└── 05_grounding_data/ # 类型5:锚点定位数据
## 主要实验结果
### 中文子集实验结果
<table>
<tr style="background-color: #f0f0f0;">
<th>模型</th>
<th>单步交互</th>
<th>多步交互</th>
<th>虚构应用</th>
<th>真实应用</th>
<th>锚点定位</th>
<th>GE得分</th>
</tr>
<tr>
<td><strong>Nano Banana pro</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>84.50</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>68.65</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>65.75</strong></td>
<td style="background-color: #F5DEB3;">64.35</td>
<td style="background-color: #FFB81C; color: black;"><strong>64.83</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>69.62</strong></td>
</tr>
<tr>
<td>Nano Banana</td>
<td>64.36</td>
<td>34.16</td>
<td style="background-color: #F5DEB3;">64.82</td>
<td style="background-color: #FFB81C; color: black;"><strong>65.89</strong></td>
<td>54.48</td>
<td>56.74</td>
</tr>
<tr>
<td><strong>GPT-image-1.5</strong></td>
<td style="background-color: #F5DEB3;">83.79</td>
<td style="background-color: #F5DEB3;">56.97</td>
<td>60.11</td>
<td>55.65</td>
<td>53.33</td>
<td style="background-color: #F5DEB3;">63.22</td>
</tr>
<tr>
<td>GPT-image-1.0</td>
<td>64.72</td>
<td>49.20</td>
<td>57.31</td>
<td>59.04</td>
<td>31.68</td>
<td>52.39</td>
</tr>
<tr>
<td>Seedream 4.5</td>
<td>63.64</td>
<td>53.11</td>
<td>56.48</td>
<td>53.44</td>
<td>52.90</td>
<td>55.91</td>
</tr>
<tr>
<td>Seedream 4.0</td>
<td>62.04</td>
<td>48.64</td>
<td>49.28</td>
<td>50.93</td>
<td>53.53</td>
<td>52.88</td>
</tr>
<tr>
<td>Wan 2.6</td>
<td>64.20</td>
<td>50.11</td>
<td>52.72</td>
<td>50.40</td>
<td style="background-color: #F5DEB3;">59.58</td>
<td>55.40</td>
</tr>
<tr>
<td>Flux-2-pro</td>
<td>68.83</td>
<td>55.07</td>
<td>58.13</td>
<td>55.41</td>
<td>50.24</td>
<td>57.54</td>
</tr>
<tr>
<td>Bagel</td>
<td>34.84</td>
<td>13.45</td>
<td>27.36</td>
<td>33.52</td>
<td>35.10</td>
<td>28.85</td>
</tr>
<tr>
<td>UniWorld-V2</td>
<td>55.33</td>
<td>24.95</td>
<td>32.03</td>
<td>21.39</td>
<td>49.60</td>
<td>36.66</td>
</tr>
<tr>
<td>Qwen-Image-Edit</td>
<td>41.12</td>
<td>26.79</td>
<td>23.78</td>
<td>26.10</td>
<td>50.80</td>
<td>33.72</td>
</tr>
<tr>
<td>Longcat-Image</td>
<td>48.76</td>
<td>12.75</td>
<td>30.03</td>
<td>30.00</td>
<td>51.02</td>
<td>34.51</td>
</tr>
</table>
### 英文子集实验结果
<table>
<tr style="background-color: #f0f0f0;">
<th>模型</th>
<th>单步交互</th>
<th>多步交互</th>
<th>虚构应用</th>
<th>真实应用</th>
<th>锚点定位</th>
<th>GE得分</th>
</tr>
<tr>
<td><strong>Nano Banana pro</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>84.32</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>69.51</strong></td>
<td>46.33</td>
<td>47.20</td>
<td style="background-color: #FFB81C; color: black;"><strong>58.64</strong></td>
<td style="background-color: #F5DEB3;">61.20</td>
</tr>
<tr>
<td>Nano Banana</td>
<td>64.80</td>
<td>50.75</td>
<td>48.88</td>
<td>47.12</td>
<td>49.04</td>
<td>52.12</td>
</tr>
<tr>
<td><strong>GPT-image-1.5</strong></td>
<td style="background-color: #F5DEB3;">80.80</td>
<td>58.87</td>
<td style="background-color: #FFB81C; color: black;"><strong>63.68</strong></td>
<td style="background-color: #FFB81C; color: black;"><strong>58.93</strong></td>
<td>49.23</td>
<td style="background-color: #FFB81C; color: black;"><strong>63.16</strong></td>
</tr>
<tr>
<td>GPT-image-1.0</td>
<td>60.92</td>
<td style="background-color: #F5DEB3;">64.33</td>
<td style="background-color: #F5DEB3;">58.94</td>
<td style="background-color: #F5DEB3;">56.16</td>
<td>37.84</td>
<td>55.64</td>
</tr>
<tr>
<td>Seedream 4.5</td>
<td>49.49</td>
<td>45.30</td>
<td>53.81</td>
<td>51.80</td>
<td>49.63</td>
<td>50.01</td>
</tr>
<tr>
<td>Seedream 4.0</td>
<td>53.28</td>
<td>37.57</td>
<td>47.92</td>
<td>49.36</td>
<td>44.17</td>
<td>46.46</td>
</tr>
<tr>
<td>Wan 2.6</td>
<td>60.17</td>
<td>44.36</td>
<td>49.55</td>
<td>44.80</td>
<td>53.36</td>
<td>50.45</td>
</tr>
<tr>
<td>Flux-2-pro</td>
<td>61.00</td>
<td>52.17</td>
<td>49.92</td>
<td>47.16</td>
<td>45.67</td>
<td>51.18</td>
</tr>
<tr>
<td>Bagel</td>
<td>32.91</td>
<td>8.61</td>
<td>26.08</td>
<td>35.12</td>
<td>37.30</td>
<td>28.00</td>
</tr>
<tr>
<td>UniWorld-V2</td>
<td>42.68</td>
<td>14.14</td>
<td>30.08</td>
<td>26.83</td>
<td>47.04</td>
<td>32.15</td>
</tr>
<tr>
<td>Qwen-Image-Edit</td>
<td>40.12</td>
<td>18.61</td>
<td>25.80</td>
<td>25.95</td>
<td style="background-color: #F5DEB3;">54.55</td>
<td>33.01</td>
</tr>
<tr>
<td>Longcat-Image</td>
<td>36.69</td>
<td>8.44</td>
<td>37.30</td>
<td>36.83</td>
<td>47.12</td>
<td>33.28</td>
</tr>
</table>
**图例:** <span style="background-color: #FFB81C; padding: 2px 6px;">橙色(🥇 排名第一)</span> 和 <span style="background-color: #F5DEB3; padding: 2px 6px;">香槟色(🥈 排名第二)</span> 表示表现最优的模型。
## 引用格式
如果您的研究中用到了GEBench,请引用以下文献:
bibtex
@article{li2026gebench,
title={GEBench: Benchmarking Image Generation Models as GUI Environments},
author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
journal={arXiv preprint arXiv:2602.09007},
year={2026}
}
提供机构:
stepfun-ai



