WebClick

Name: WebClick
Creator: maas
Published: 2026-01-06 16:34:39
License: 暂无描述

魔搭社区2026-01-06 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/Hcompany/WebClick

下载链接

链接失效反馈

官方服务：

资源简介：

# WebClick: A Multimodal Localization Benchmark for Web-Navigation Models We introduce WebClick, a high-quality benchmark dataset for evaluating navigation and localization capabilities of multimodal models and agents in Web environments. WebClick features 1,639 English-language web screenshots from over 100 websites paired with precisely annotated natural-language instructions and pixel-level click targets, in the same format as the widely-used screenspot benchmark. ## Design Goals and Use Case WebClick is designed to measure and advance the ability of AI systems to understand web interfaces, interpret user instructions, and take accurate actions within digital environments. The dataset contains three distinct groups of web screenshots that capture a range of real-world navigation scenarios, from agent-based web retrieval to human tasks like online shopping and calendar management. On a more technical level, this benchmark is intended for assessing multimodal models on their ability to navigate web interfaces, evaluating AI agents' understanding of UI elements and their functions, and testing models' abilities to ground natural language instructions to specific interactive elements. Project page: https://www.surferh.com ## Dataset Structure The dataset contains 1,639 samples divided into three key groups: 1. **`agentbrowse` (36%)**: Pages encountered by the SurferH agent while solving web retrieval tasks from [WebVoyager](https://arxiv.org/abs/2401.13919) 2. **`humanbrowse` (31.8%)**: Pages and elements interacted with by humans performing everyday tasks (e-shopping, trip planning, personal organization) 3. **`calendars` (32.2%)**: A specialized subset focusing on calendar interfaces, a known challenge for UI understanding models Each sample consists of: - **`image`**: A screenshot of a web page - **`instruction`**: A natural language instruction describing the desired action - **`bbox`**: Coordinates of the bounding box (relative to the image dimensions) that identify the correct click target, such as an input field or a button - **`bucket`**: One of `agentbrowse`, `humanbrowse`, `calendars`: group this row belongs to The dataset includes several challenging scenarios: - Disambiguation between similar elements (e.g., "the login button in the middle", “the login button in the top-right”) - Cases where OCR is insufficient because the visible text isn't the interactive element - Navigation requiring understanding of relative spatial relationships between information and interaction points ## Dataset Creation: High Quality Annotations and NLP Instructions A key strength of this benchmark is its meticulous annotation: all bounding boxes correspond precisely to HTML element boundaries, ensuring rigorous evaluation of model performance. Each screenshot is paired with natural language instructions that simulate realistic navigation requests, requiring models to not only understand UI elements but also interpret contextual relationships between visual elements. ### Curation Rationale WebClick focuses on realism by capturing authentic interactions: actions taken by humans and agents. The records of WebClick are English-language, desktop-size screenshots of 100+ websites. Each record points to an element outlined by a rectangular bounding box and an intent corresponding to it. In particular, the dataset focuses on providing bounding boxes and intents that are not ambiguous, thus increasing the trustworthiness of the evaluation of a VLM on this data. ### Challenging Examples for UI Element Selection With this new benchmark, H Company aims to unlock new capabilities in VLMs, and stimulate the progress of web agents. [comment]: # (Link to presentation with images https://docs.google.com/presentation/d/1NQGq75Ao_r-4GF8WCyK0BRPCdvkjzxIE2xP9ttV5UcM/edit#slide=id.g358e1dac3df_0_60) Our dataset includes examples that go beyond standard object detection or OCR, requiring genuine **UI understanding** and **instruction-based visual reasoning**. These examples highlight failure points in current models and test capabilities critical for real-world interaction with user interfaces, demonstrating H Company's commitment to creating targeted benchmarks around challenging areas. ### Key Challenges Captured in the Benchmark - **UI Understanding** Tasks require comprehension of common UI conventions (e.g., icons, labels, layout). For instance, identifying the correct user settings button may involve recognizing a gear icon, or adding a specific product to a cart might require interpreting both imagery and adjacent labels. State-of-the-art models often fail at such tasks due to lack of contextual or semantic UI awareness. - **Instruction-Based Disambiguation** Some instructions describe objects based on spatial position, appearance, or intent (e.g., "middle of the page", "green button"). These tasks require combining textual instruction with visual reasoning in order to solve them — a challange most models do not yet handle robustly. - **Calendar Navigation** Even frontier models struggle to interact with calendar widgets. Understanding which dates are available (e.g., not grayed out or marked unavailable) is a frequent failure case, demonstrating gaps in dynamic UI interpretation. - **Format and Location Sensitivity** Instructions that rely on regional formats—like time (“18:45”) or date representations—test the model’s resilience to location-specific variations. Models trained on culturally homogeneous data often perform poorly here. ### Example Tasks | **Category** | **Instruction** | **Image** | |------------------------|------------------------------------------------|-----------| | UI Understanding | Access user account settings | ![Access user account settings](./examples/Access%20user%20account%20settings.png) | | UI Understanding | Add Insignia cable to cart | ![Add Insignia cable to cart](./examples/Add%20Insignia%20cable%20to%20cart.png) | | UI Understanding | Pick the first available date | ![Pick the first available date](./examples/Pick%20the%20first%20available%20date.png) | | Format Understanding | Choose 18:45 | ![Choose 18:45](./examples/Choose%2018_45.png) | | UI Disambiguation | Green button to create a travel alert | ![Green Button to create a travel alert](./examples/Green%20Button%20to%20create%20a%20travel%20alert.png) | | UI Disambiguation | Log in button (middle of the page) | ![log in button (middle of the page)](./examples/log%20in%20button%20(middle%20of%20the%20page).png) | | UI Disambiguation | Select fifth image in gallery | ![Select fifth image in gallery](./examples/Select%20fifth%20image%20in%20gallery.png) | | Calendar Understanding | Select Aug 7th | ![Select aug 7th](./examples/Select%20aug%207th.png) | # Results of Popular Models To put our benchmark into context, we evaluate our benchmark alongside the popular Screenspot [1] and ScreenspotV2 [2] benchmarks using a set of popular pre-trained models. From the table we can observe that the models mostly underperform on WebClick compared to both Screenspot benchmarks, making it a more challenging task. We also find that WebClick provides better signal for downstream performance for agentic applications of the model. | **Model** | **WebClick (ours)** | Screenspot | Screenspot V2 | |-------------------------------|----------------------------|------------|---------------| | osunlp/UGround-V1-2B [3] | 71.69% | 77.12% | 79.31% | | osunlp/UGround-V1-7B [3] | 82.37% | 85.69% | 84.26% | | Qwen/Qwen2.5-VL-3B-Instruct [4] | 71.15% | 82.78% | 84.34% | | Qwen/Qwen2.5-VL-7B-Instruct [4] | 74.37% | 85.53% | 88.04% | | ByteDance-Seed/UI-TARS-2B-SFT [5] | 64.23% | 66.82% | 69.39% | | ByteDance-Seed/UI-TARS-7B-DPO [5] | 80.67% | 84.20% | 86.70% | | Holo1-3B | 81.50% | 86.01% | 87.33% | | Holo1-7B | 84.03% | 87.42% | 89.85% | ### Annotations Annotations were created by UI experts with specialized knowledge of web interfaces. Each screenshot was paired with a natural language instruction describing an intended action, and a bounding box precisely matching HTML element boundaries. All labels were hand-written or hand-reviewed. Instructions were rewritten when needed to only contain non-ambiguous intents rather than visual descriptions. Screenshots were manually reviewed to avoid any personal information, with any identifiable data removed or anonymized. ### Licence - **Curated by:** H Company - **Language:** English - **License:** Apache 2.0 ### Dataset Sources - **Paper:** https://arxiv.org/abs/2506.02865 ## Citation [1] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug. 2024 [2] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao arXiv preprint arXiv:2410.23218 (2024) [3] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su The Thirteenth International Conference on Learning Representations (2025) [4] Qwen2.5-VL Technical Report Qwen Team arXiv preprint arXiv:2502.13923 (2025) [5] UI-TARS: Pioneering Automated GUI Interaction with Native Agents Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi arXiv:2501.12326 (2025) **BibTeX:** ``` @dataset{hcompany2025uinavigate, author = {H Company Research Team}, title = {WebClick: A Multimodal Localization Benchmark for Web-Navigation Models}, year = {2025}, publisher = {H Company}, } @misc{andreux2025surferhmeetsholo1costefficient, title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu}, year={2025}, eprint={2506.02865}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.02865}, } ``` ## Dataset Card Contact research@hcompany.ai

# WebClick：面向网页导航模型的多模态定位基准数据集我们提出WebClick数据集，这是一个用于评估多模态模型与智能体在网页环境中导航与定位能力的高质量基准数据集。WebClick包含来自100余个网站的1639张英文网页截图，搭配精准标注的自然语言指令与像素级点击目标，格式与广泛使用的Screenspot基准数据集一致。 ## 设计目标与应用场景 WebClick旨在衡量并推动AI系统理解网页界面、解读用户指令，并在数字环境中执行精准操作的能力。本数据集包含三类不同的网页截图，覆盖了多种真实导航场景，从基于智能体的网页检索，到人类日常任务（如在线购物、日历管理）均有涉及。从更技术的层面来看，该基准数据集可用于评估多模态模型的网页界面导航能力、检验AI智能体对用户界面（User Interface，UI）元素及其功能的理解程度，并测试模型将自然语言指令映射至特定交互元素的落地能力。项目主页：https://www.surferh.com ## 数据集结构本数据集共包含1639个样本，分为三大核心类别： 1. **`agentbrowse`（占比36%）**：SurferH智能体在解决[WebVoyager](https://arxiv.org/abs/2401.13919)提出的网页检索任务时遇到的页面。 2. **`humanbrowse`（占比31.8%）**：人类执行日常任务（电子购物、行程规划、个人事务管理）时交互的页面与元素。 3. **`calendars`（占比32.2%）**：专注于日历界面的专门子集，这是用户界面（UI）理解模型的经典挑战场景。每个样本包含以下内容： - **`image`**：网页截图 - **`instruction`**：描述目标操作的自然语言指令 - **`bbox`**：用于标识正确点击目标（如输入框或按钮）的边界框（bounding box）坐标（基于图像尺寸的相对坐标） - **`bucket`**：标识该样本所属的类别，可选值为`agentbrowse`、`humanbrowse`或`calendars` 本数据集包含多种高挑战性场景： - 相似元素的歧义消解（例如"中间的登录按钮"、"右上角的登录按钮"） - 光学字符识别（Optical Character Recognition，OCR）失效的场景：可见文本并非交互元素 - 需要理解信息与交互点之间相对空间关系的导航任务 ## 数据集构建：高质量标注与自然语言指令该基准数据集的核心优势之一在于其精细的标注流程：所有边界框（bounding box）均与HTML元素的边界精准对应，确保对模型性能的严谨评估。每张截图均搭配模拟真实导航需求的自然语言指令，要求模型不仅理解用户界面（UI）元素，还需解读视觉元素间的上下文关联。 ### 遴选依据 WebClick通过采集真实交互场景来保证数据集的真实性：涵盖人类与智能体执行的操作。 WebClick的样本均为来自100余个网站的英文桌面版网页截图。每个样本均对应一个由矩形边界框（bounding box）框选的元素，以及与之匹配的交互意图。尤为重要的是，本数据集仅保留无歧义的边界框与交互意图，从而提升了基于该数据集评估视觉语言模型（Visual Language Model，VLM）结果的可信度。 ### 用户界面（UI）元素选择的挑战性样本依托该基准数据集，H公司旨在解锁视觉语言模型（VLM）的全新能力，并推动网页智能体领域的发展。 [//]: # (带图片的演示文稿链接：https://docs.google.com/presentation/d/1NQGq75Ao_r-4GF8WCyK0BRPCdvkjzxIE2xP9ttV5UcM/edit#slide=id.g358e1dac3df_0_60) 本数据集包含的样本超越了标准目标检测或光学字符识别（OCR）的范畴，需要真正的**用户界面（UI）理解**与**基于指令的视觉推理**能力。这些样本揭示了当前模型的薄弱环节，并测试了与用户界面（UI）进行真实交互所需的核心能力，体现了H公司致力于围绕挑战性领域构建针对性基准数据集的理念。 ### 基准数据集涵盖的核心挑战 - **用户界面（UI）理解** 此类任务要求模型理解常见的用户界面（UI）规范（例如图标、标签、布局）。例如，识别正确的用户设置按钮可能需要识别齿轮图标，或将特定商品添加至购物车可能需要同时解读图像与相邻标签。由于缺乏上下文或语义层面的用户界面（UI）认知能力，当前的顶尖模型往往难以完成此类任务。 - **基于指令的歧义消解** 部分指令会基于空间位置、外观或意图描述目标对象（例如"页面中间"、"绿色按钮"）。此类任务需要将文本指令与视觉推理相结合才能完成，而这正是当前多数模型尚未能稳健处理的挑战。 - **日历界面导航** 即便前沿模型也难以与日历小组件进行交互。识别可用日期（例如未灰显或未标记为不可用的日期）是常见的失效场景，这暴露出模型在动态用户界面（UI）解读方面的短板。 - **格式与地域敏感性** 依赖地域格式的指令（例如时间"18:45"或日期表示方式）会测试模型对地域差异的适应能力。在同质化文化数据上训练的模型往往在此类任务中表现不佳。 ### 示例任务 | **类别** | **指令** | **图像** | |------------------------|------------------------------------------------|-----------| | 用户界面（UI）理解 | 访问用户账户设置 | ![访问用户账户设置](./examples/Access%20user%20account%20settings.png) | | 用户界面（UI）理解 | 将Insignia线缆添加至购物车 | ![将Insignia线缆添加至购物车](./examples/Add%20Insignia%20cable%20to%20cart.png) | | 用户界面（UI）理解 | 选择首个可用日期 | ![选择首个可用日期](./examples/Pick%20the%20first%20available%20date.png) | | 格式理解 | 选择18:45 | ![选择18:45](./examples/Choose%2018_45.png) | | 用户界面（UI）歧义消解 | 绿色按钮以创建旅行警报 | ![绿色按钮以创建旅行警报](./examples/Green%20Button%20to%20create%20a%20travel%20alert.png) | | 用户界面（UI）歧义消解 | 登录按钮（页面中间） | ![登录按钮（页面中间）](./examples/log%20in%20button%20(middle%20of%20the%20page).png) | | 用户界面（UI）歧义消解 | 选择图库中的第五张图片 | ![选择图库中的第五张图片](./examples/Select%20fifth%20image%20in%20gallery.png) | | 日历界面理解 | 选择8月7日 | ![选择8月7日](./examples/Select%20aug%207th.png) | # 主流模型测试结果为了明确该基准数据集的难度，我们结合Screenspot [1]与ScreenspotV2 [2]这两款主流基准数据集，使用多款热门预训练模型对WebClick进行了评估。从表格中可以看出，相较于两款Screenspot基准数据集，多数模型在WebClick上的表现均较差，这表明WebClick是一项更具挑战性的任务。同时我们发现，WebClick能够为模型的智能体类下游应用提供更有效的性能评估信号。 | **模型** | **WebClick（ ours）** | Screenspot | Screenspot V2 | |-------------------------------|----------------------------|------------|---------------| | osunlp/UGround-V1-2B [3] | 71.69% | 77.12% | 79.31% | | osunlp/UGround-V1-7B [3] | 82.37% | 85.69% | 84.26% | | Qwen/Qwen2.5-VL-3B-Instruct [4] | 71.15% | 82.78% | 84.34% | | Qwen/Qwen2.5-VL-7B-Instruct [4] | 74.37% | 85.53% | 88.04% | | ByteDance-Seed/UI-TARS-2B-SFT [5] | 64.23% | 66.82% | 69.39% | | ByteDance-Seed/UI-TARS-7B-DPO [5] | 80.67% | 84.20% | 86.70% | | Holo1-3B | 81.50% | 86.01% | 87.33% | | Holo1-7B | 84.03% | 87.42% | 89.85% | ### 标注说明标注工作由具备网页界面专业知识的用户界面（UI）专家完成。每张截图均搭配描述目标操作的自然语言指令，以及与HTML元素边界精准匹配的边界框（bounding box）。所有标签均为手写或经人工审核。必要时会重写指令，仅保留无歧义的交互意图，而非视觉描述。所有截图均经人工审核，以避免包含个人信息，所有可识别数据均已删除或匿名化处理。 ### 许可协议 - **制作方**：H公司 - **语言**：英语 - **许可协议**：Apache 2.0 ### 数据集来源 - **论文链接**：https://arxiv.org/abs/2506.02865 ## 引用 [1] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug. 2024 [2] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao arXiv preprint arXiv:2410.23218 (2024) [3] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su The Thirteenth International Conference on Learning Representations (2025) [4] Qwen2.5-VL Technical Report Qwen Team arXiv preprint arXiv:2502.13923 (2025) [5] UI-TARS: Pioneering Automated GUI Interaction with Native Agents Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi arXiv:2501.12326 (2025) **BibTeX:** @dataset{hcompany2025uinavigate, author = {H Company Research Team}, title = {WebClick: A Multimodal Localization Benchmark for Web-Navigation Models}, year = {2025}, publisher = {H Company}, } @misc{andreux2025surferhmeetsholo1costefficient, title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu}, year={2025}, eprint={2506.02865}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.02865}, } ## 数据集卡片联系方式 research@hcompany.ai

提供机构：

maas

创建时间：

2025-06-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集