five

ScreenSpot

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/rootsautomation/ScreenSpot
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for ScreenSpot GUI Grounding Benchmark: ScreenSpot. Created researchers at Nanjing University and Shanghai AI Laboratory for evaluating large multimodal models (LMMs) on GUI grounding tasks on screens given a text-based instruction. ## Dataset Details ### Dataset Description ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). See details and more examples in the paper. - **Curated by:** NJU, Shanghai AI Lab - **Language(s) (NLP):** EN - **License:** Apache 2.0 ### Dataset Sources - **Repository:** [GitHub](https://github.com/njucckevin/SeeClick) - **Paper:** [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) ## Uses This dataset is a benchmarking dataset. It is not used for training. It is used to zero-shot evaluate a multimodal model's ability to locally ground on screens. ## Dataset Structure Each test sample contains: - `image`: Raw pixels of the screenshot - `file_name`: the interface screenshot filename - `instruction`: human instruction to prompt localization - `bbox`: the bounding box of the target element corresponding to instruction. While the original dataset had this in the form of a 4-tuple of (top-left x, top-left y, width, height), we first transform this to (top-left x, top-left y, bottom-right x, bottom-right y) for compatibility with other datasets. - `data_type`: "icon"/"text", indicates the type of the target element - `data_souce`: interface platform, including iOS, Android, macOS, Windows and Web (Gitlab, Shop, Forum and Tool) ## Dataset Creation ### Curation Rationale This dataset was created to benchmark multimodal models on screens. Specifically, to assess a model's ability to translate text into a local reference within the image. ### Source Data Screenshot data spanning dekstop screens (Windows, macOS), mobile screens (iPhone, iPad, Android), and web screens. #### Data Collection and Processing Sceenshots were selected by annotators based on their typical daily usage of their device. After collecting a screen, annotators would provide annotations for important clickable regions. Finally, annotators then write an instruction to prompt a model to interact with a particular annotated element. #### Who are the source data producers? PhD and Master students in Comptuer Science at NJU. All are proficient in the usage of both mobile and desktop devices. ## Citation **BibTeX:** ``` @misc{cheng2024seeclick, title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, year={2024}, eprint={2401.10935}, archivePrefix={arXiv}, primaryClass={cs.HC} } ```

# ScreenSpot 数据集卡片 GUI 接地基准数据集:ScreenSpot。 本数据集由南京大学与上海人工智能实验室的研究者构建,用于基于文本指令评估大型多模态模型(Large Multimodal Models, LMMs)在屏幕场景下的GUI接地任务表现。 ## 数据集详情 ### 数据集描述 ScreenSpot是一款GUI接地任务评估基准数据集,涵盖来自iOS、Android、macOS、Windows及Web平台的1200余条文本指令,并附带标注的元素类型(文本或图标/控件)。详细信息与更多示例请参阅相关论文。 - **数据集构建方:** 南京大学(Nanjing University, NJU)、上海人工智能实验室 - **自然语言处理所用语言:** 英语 - **许可协议:** Apache 2.0 ### 数据集来源 - **代码仓库:** [GitHub](https://github.com/njucckevin/SeeClick) - **论文:** [SeeClick:利用GUI接地技术赋能高级视觉GUI智能体](https://arxiv.org/abs/2401.10935) ## 数据集用途 本数据集为基准测试数据集,不用于模型训练,仅用于以零样本(Zero-shot)方式评估多模态模型在屏幕场景下的局部接地能力。 ## 数据集结构 每条测试样本包含以下字段: - `image`:截图的原始像素数据 - `file_name`:界面截图的文件名 - `instruction`:用于引导模型定位目标元素的人类指令 - `bbox`:与指令对应的目标元素的边界框。原始数据集采用(左上角x坐标、左上角y坐标、宽度、高度)的四元组格式,为适配其他数据集,我们已将其转换为(左上角x坐标、左上角y坐标、右下角x坐标、右下角y坐标)格式。 - `data_type`:取值为"icon"或"text",用于标注目标元素的类型 - `data_source`:界面所属平台,涵盖iOS、Android、macOS、Windows及Web(Gitlab、电商、论坛与工具类网站) ## 数据集构建流程 ### 构建初衷 本数据集旨在构建多模态模型的屏幕场景基准测试基准,具体用于评估模型将文本指令映射为图像内局部区域参考的能力。 ### 源数据 源截图数据涵盖桌面端屏幕(Windows、macOS)、移动端屏幕(iPhone、iPad、Android)及Web端屏幕。 #### 数据收集与处理流程 标注人员基于日常使用设备的典型场景选取截图。采集截图后,标注人员会对其中重要的可交互区域进行标注。最后,标注人员会编写一条指令,用于引导模型与某一特定标注元素进行交互。 #### 源数据制作方 南京大学计算机科学专业的博士与硕士研究生。所有参与者均熟练掌握移动端与桌面端设备的使用方法。 ## 引用格式 **BibTeX 引用格式:** @misc{cheng2024seeclick, title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, year={2024}, eprint={2401.10935}, archivePrefix={arXiv}, primaryClass={cs.HC} }
提供机构:
maas
创建时间:
2025-10-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作