ScreenSpot
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/rootsautomation/ScreenSpot
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for ScreenSpot
GUI Grounding Benchmark: ScreenSpot.
Created researchers at Nanjing University and Shanghai AI Laboratory for evaluating large multimodal models (LMMs) on GUI grounding tasks on screens given a text-based instruction.
## Dataset Details
### Dataset Description
ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
See details and more examples in the paper.
- **Curated by:** NJU, Shanghai AI Lab
- **Language(s) (NLP):** EN
- **License:** Apache 2.0
### Dataset Sources
- **Repository:** [GitHub](https://github.com/njucckevin/SeeClick)
- **Paper:** [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
## Uses
This dataset is a benchmarking dataset. It is not used for training. It is used to zero-shot evaluate a multimodal model's ability to locally ground on screens.
## Dataset Structure
Each test sample contains:
- `image`: Raw pixels of the screenshot
- `file_name`: the interface screenshot filename
- `instruction`: human instruction to prompt localization
- `bbox`: the bounding box of the target element corresponding to instruction. While the original dataset had this in the form of a 4-tuple of (top-left x, top-left y, width, height), we first transform this to (top-left x, top-left y, bottom-right x, bottom-right y) for compatibility with other datasets.
- `data_type`: "icon"/"text", indicates the type of the target element
- `data_souce`: interface platform, including iOS, Android, macOS, Windows and Web (Gitlab, Shop, Forum and Tool)
## Dataset Creation
### Curation Rationale
This dataset was created to benchmark multimodal models on screens.
Specifically, to assess a model's ability to translate text into a local reference within the image.
### Source Data
Screenshot data spanning dekstop screens (Windows, macOS), mobile screens (iPhone, iPad, Android), and web screens.
#### Data Collection and Processing
Sceenshots were selected by annotators based on their typical daily usage of their device.
After collecting a screen, annotators would provide annotations for important clickable regions.
Finally, annotators then write an instruction to prompt a model to interact with a particular annotated element.
#### Who are the source data producers?
PhD and Master students in Comptuer Science at NJU.
All are proficient in the usage of both mobile and desktop devices.
## Citation
**BibTeX:**
```
@misc{cheng2024seeclick,
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
year={2024},
eprint={2401.10935},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
```
# ScreenSpot 数据集卡片
GUI 接地基准数据集:ScreenSpot。
本数据集由南京大学与上海人工智能实验室的研究者构建,用于基于文本指令评估大型多模态模型(Large Multimodal Models, LMMs)在屏幕场景下的GUI接地任务表现。
## 数据集详情
### 数据集描述
ScreenSpot是一款GUI接地任务评估基准数据集,涵盖来自iOS、Android、macOS、Windows及Web平台的1200余条文本指令,并附带标注的元素类型(文本或图标/控件)。详细信息与更多示例请参阅相关论文。
- **数据集构建方:** 南京大学(Nanjing University, NJU)、上海人工智能实验室
- **自然语言处理所用语言:** 英语
- **许可协议:** Apache 2.0
### 数据集来源
- **代码仓库:** [GitHub](https://github.com/njucckevin/SeeClick)
- **论文:** [SeeClick:利用GUI接地技术赋能高级视觉GUI智能体](https://arxiv.org/abs/2401.10935)
## 数据集用途
本数据集为基准测试数据集,不用于模型训练,仅用于以零样本(Zero-shot)方式评估多模态模型在屏幕场景下的局部接地能力。
## 数据集结构
每条测试样本包含以下字段:
- `image`:截图的原始像素数据
- `file_name`:界面截图的文件名
- `instruction`:用于引导模型定位目标元素的人类指令
- `bbox`:与指令对应的目标元素的边界框。原始数据集采用(左上角x坐标、左上角y坐标、宽度、高度)的四元组格式,为适配其他数据集,我们已将其转换为(左上角x坐标、左上角y坐标、右下角x坐标、右下角y坐标)格式。
- `data_type`:取值为"icon"或"text",用于标注目标元素的类型
- `data_source`:界面所属平台,涵盖iOS、Android、macOS、Windows及Web(Gitlab、电商、论坛与工具类网站)
## 数据集构建流程
### 构建初衷
本数据集旨在构建多模态模型的屏幕场景基准测试基准,具体用于评估模型将文本指令映射为图像内局部区域参考的能力。
### 源数据
源截图数据涵盖桌面端屏幕(Windows、macOS)、移动端屏幕(iPhone、iPad、Android)及Web端屏幕。
#### 数据收集与处理流程
标注人员基于日常使用设备的典型场景选取截图。采集截图后,标注人员会对其中重要的可交互区域进行标注。最后,标注人员会编写一条指令,用于引导模型与某一特定标注元素进行交互。
#### 源数据制作方
南京大学计算机科学专业的博士与硕士研究生。所有参与者均熟练掌握移动端与桌面端设备的使用方法。
## 引用格式
**BibTeX 引用格式:**
@misc{cheng2024seeclick,
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
year={2024},
eprint={2401.10935},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
提供机构:
maas
创建时间:
2025-10-14



