bytedance-research/Web-Bench
收藏Hugging Face2025-05-19 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/bytedance-research/Web-Bench
下载链接
链接失效反馈官方服务:
资源简介:
Web-Bench是一个用于评估大型语言模型在真实Web开发中性能的基准测试。它包含50个由具有5-10年经验的工程师设计的项目,每个项目包含20个具有顺序依赖关系的任务。这些任务按顺序实现项目特性,模拟真实世界的开发工作流程。Web-Bench旨在涵盖Web开发的基础元素:Web标准和Web框架。由于项目的规模和复杂性,每个项目对工程师来说都是一个重大挑战,平均需要高级工程师4-8小时来完成一个项目。在提供的基准代理(Web-Agent)上,当前最先进的模型(Claude 3.7 Sonnet)的通过率仅为25.1%。
Web-Bench is a benchmark designed to evaluate the performance of large language models in actual Web development. It consists of 50 projects, each with 20 tasks that have sequential dependencies, designed by engineers with 5-10 years of experience. These tasks implement project features in sequence, simulating real-world development workflows. Web-Bench aims to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, each presents a significant challenge for engineers, with an average completion time of 4-8 hours for a senior engineer. On the provided benchmark agent (Web-Agent), the state-of-the-art model (Claude 3.7 Sonnet) achieves only a 25.1% pass rate.
提供机构:
bytedance-research



