Online-Mind2Web
收藏Online-Mind2Web Benchmark 数据集概述
数据集基本信息
- 名称: Online-Mind2Web Benchmark
- 开发者: 来自俄亥俄州立大学和加州大学伯克利分校的研究团队
- 相关链接:
- 博客: https://tiancixue.notion.site/An-Illusion-of-Progress-Assessing-the-Current-State-of-Web-Agents-1ac6cd2b9aac80719cd6f68374aaf4b4?pvs=4
- 排行榜: https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard
- 数据: https://huggingface.co/datasets/osunlp/Online-Mind2Web
任务与内容
- 任务数量: 300个多样化任务
- 网站数量: 136个流行网站
- 覆盖领域: 包括服装、食品、住房、交通等多个真实世界用户任务领域
评估方法
- 自动评估器: 基于LLM-as-a-Judge的自动评估方法
- 关键点识别: 根据指令和任务描述识别完成任务所需的关键点
- 关键截图识别: 从代理轨迹中选择重要截图
- 结果判断: 基于任务描述、关键点、关键截图和动作历史输出判断结果
环境设置
- Python版本: 3.11
- 依赖安装: bash conda create -n Online_Mind2Web python=3.11 conda activate Online_Mind2Web pip install -r requirements.txt
评估
- 评估脚本: bash bash ./script/eval.sh
引用
bibtex @article{xue2025webagents, title = "An Illusion of Progress? Assessing the Current State of Web Agents", author = "Xue, Tianci and Qi, Weijian and Shi, Tianneng and Song, Chan Hee and Gou, Boyu and Song, Dawn and Sun, Huan and Su, Yu", journal = "OSU NLP Blog", year = "2025", month = "Mar", url = "https://tiancixue.notion.site/An-Illusion-of-Progress-Assessing-the-Current-State-of-Web-Agents-1ac6cd2b9aac80719cd6f68374aaf4b4" }
@inproceedings{deng2023mind2web, author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu}, booktitle = {Advances in Neural Information Processing Systems}, editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine}, pages = {28091--28114}, publisher = {Curran Associates, Inc.}, title = {Mind2Web: Towards a Generalist Agent for the Web}, url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf}, volume = {36}, year = {2023} }




