ScreenSpot

Name: ScreenSpot
Creator: maas
Published: 2025-12-05 16:54:45
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/rootsautomation/ScreenSpot

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for ScreenSpot GUI Grounding Benchmark: ScreenSpot. Created researchers at Nanjing University and Shanghai AI Laboratory for evaluating large multimodal models (LMMs) on GUI grounding tasks on screens given a text-based instruction. ## Dataset Details ### Dataset Description ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). See details and more examples in the paper. - **Curated by:** NJU, Shanghai AI Lab - **Language(s) (NLP):** EN - **License:** Apache 2.0 ### Dataset Sources - **Repository:** [GitHub](https://github.com/njucckevin/SeeClick) - **Paper:** [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) ## Uses This dataset is a benchmarking dataset. It is not used for training. It is used to zero-shot evaluate a multimodal model's ability to locally ground on screens. ## Dataset Structure Each test sample contains: - `image`: Raw pixels of the screenshot - `file_name`: the interface screenshot filename - `instruction`: human instruction to prompt localization - `bbox`: the bounding box of the target element corresponding to instruction. While the original dataset had this in the form of a 4-tuple of (top-left x, top-left y, width, height), we first transform this to (top-left x, top-left y, bottom-right x, bottom-right y) for compatibility with other datasets. - `data_type`: "icon"/"text", indicates the type of the target element - `data_souce`: interface platform, including iOS, Android, macOS, Windows and Web (Gitlab, Shop, Forum and Tool) ## Dataset Creation ### Curation Rationale This dataset was created to benchmark multimodal models on screens. Specifically, to assess a model's ability to translate text into a local reference within the image. ### Source Data Screenshot data spanning dekstop screens (Windows, macOS), mobile screens (iPhone, iPad, Android), and web screens. #### Data Collection and Processing Sceenshots were selected by annotators based on their typical daily usage of their device. After collecting a screen, annotators would provide annotations for important clickable regions. Finally, annotators then write an instruction to prompt a model to interact with a particular annotated element. #### Who are the source data producers? PhD and Master students in Comptuer Science at NJU. All are proficient in the usage of both mobile and desktop devices. ## Citation **BibTeX:** ``` @misc{cheng2024seeclick, title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, year={2024}, eprint={2401.10935}, archivePrefix={arXiv}, primaryClass={cs.HC} } ```

# ScreenSpot 数据集卡片 GUI 接地基准数据集：ScreenSpot。本数据集由南京大学与上海人工智能实验室的研究者构建，用于基于文本指令评估大型多模态模型（Large Multimodal Models, LMMs）在屏幕场景下的GUI接地任务表现。 ## 数据集详情 ### 数据集描述 ScreenSpot是一款GUI接地任务评估基准数据集，涵盖来自iOS、Android、macOS、Windows及Web平台的1200余条文本指令，并附带标注的元素类型（文本或图标/控件）。详细信息与更多示例请参阅相关论文。 - **数据集构建方：** 南京大学（Nanjing University, NJU）、上海人工智能实验室 - **自然语言处理所用语言：** 英语 - **许可协议：** Apache 2.0 ### 数据集来源 - **代码仓库：** [GitHub](https://github.com/njucckevin/SeeClick) - **论文：** [SeeClick：利用GUI接地技术赋能高级视觉GUI智能体](https://arxiv.org/abs/2401.10935) ## 数据集用途本数据集为基准测试数据集，不用于模型训练，仅用于以零样本（Zero-shot）方式评估多模态模型在屏幕场景下的局部接地能力。 ## 数据集结构每条测试样本包含以下字段： - `image`：截图的原始像素数据 - `file_name`：界面截图的文件名 - `instruction`：用于引导模型定位目标元素的人类指令 - `bbox`：与指令对应的目标元素的边界框。原始数据集采用（左上角x坐标、左上角y坐标、宽度、高度）的四元组格式，为适配其他数据集，我们已将其转换为（左上角x坐标、左上角y坐标、右下角x坐标、右下角y坐标）格式。 - `data_type`：取值为"icon"或"text"，用于标注目标元素的类型 - `data_source`：界面所属平台，涵盖iOS、Android、macOS、Windows及Web（Gitlab、电商、论坛与工具类网站） ## 数据集构建流程 ### 构建初衷本数据集旨在构建多模态模型的屏幕场景基准测试基准，具体用于评估模型将文本指令映射为图像内局部区域参考的能力。 ### 源数据源截图数据涵盖桌面端屏幕（Windows、macOS）、移动端屏幕（iPhone、iPad、Android）及Web端屏幕。 #### 数据收集与处理流程标注人员基于日常使用设备的典型场景选取截图。采集截图后，标注人员会对其中重要的可交互区域进行标注。最后，标注人员会编写一条指令，用于引导模型与某一特定标注元素进行交互。 #### 源数据制作方南京大学计算机科学专业的博士与硕士研究生。所有参与者均熟练掌握移动端与桌面端设备的使用方法。 ## 引用格式 **BibTeX 引用格式：** @misc{cheng2024seeclick, title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, year={2024}, eprint={2401.10935}, archivePrefix={arXiv}, primaryClass={cs.HC} }

提供机构：

maas

创建时间：

2025-10-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集