inclusionAI/VenusBench-GD
收藏Hugging Face2025-12-20 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/inclusionAI/VenusBench-GD
下载链接
链接失效反馈官方服务:
资源简介:
VenusBench-GD是一个全面的、多平台的GUI grounding基准测试,旨在解决现有基准测试在数据量、领域覆盖和平台多样性方面的不足。该数据集通过大规模、跨平台的数据覆盖和高质量的数据构建流程,支持分层评估,适用于实际应用。数据集包含六个不同的子任务,分为基础和高级两类,旨在从不同角度评估模型性能。实验结果表明,通用多模态模型在基础任务上已经达到或超过专用GUI模型的性能,而高级任务仍需要专用模型,显示出明显的过拟合和鲁棒性不足。这些发现强调了像VenusBench-GD这样的全面、多层次评估框架在指导GUI agent未来发展中的必要性。
VenusBench-GD is a comprehensive, multi-platform GUI grounding benchmark designed to address the limitations of existing benchmarks in terms of data volume, domain coverage, and platform diversity. The dataset supports hierarchical evaluation for real-world applications through large-scale, cross-platform data coverage and a high-quality data construction pipeline. It includes six distinct subtasks divided into basic and advanced categories, designed to evaluate models from complementary perspectives. Experimental results reveal that general-purpose multimodal models now match or surpass specialized GUI models on basic tasks, while advanced tasks still favor specialized models, though they exhibit significant overfitting and poor robustness. These findings underscore the necessity of comprehensive, multi-tiered evaluation frameworks like VenusBench-GD to guide future progress in GUI agent development.
提供机构:
inclusionAI



