Web-Bench
收藏魔搭社区2026-01-02 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/bytedance-research/Web-Bench
下载链接
链接失效反馈官方服务:
资源简介:
# Web-Bench
English | [中文 README](README.zh_CN.md)
## 📖 Overview
**Web-Bench** is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1\% Pass@1.
The distribution of the experimental data aligns well with the current code generation capabilities of mainstream LLMs.
<img width="500" alt="pass@1" src="./docs/assets/pass-1.png" />
HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1\%, which is lower (better) than that of the SWE-bench Full and Verified sets.
<img width="500" alt="SOTAs" src="./docs/assets/sotas.png" />
## Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
The datasets was presented in the paper [Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks](https://huggingface.co/papers/2505.07473).
## 🏅 Leaderboard
[Leaderboard](https://huggingface.co/spaces/bytedance-research/Web-Bench-Leaderboard)
## Dataset Structure
An example of a Web-Bench datum is as follows:
```
id: (str) Task id, init | task-n
project: (str) Task project name
description: (str) Task details description
date: (str) Task publish date, filter contaminated model
level: (str) Task level: easy | moderate | challenging
```
## 📘 Usage
[GitHub](https://github.com/bytedance/web-bench)
# Web-Bench
English | [中文 README](README.zh_CN.md)
## 📖 概述
**Web-Bench** 是一款用于评估大语言模型(Large Language Model,LLM)在实际Web开发场景中性能的基准测试集。Web-Bench包含50个项目,每个项目下设20个具备顺序依赖关系的任务。这些任务按流程依次实现项目功能,完整模拟真实人类的Web开发工作流。在设计之初,我们便致力于覆盖Web开发的核心要素:Web标准与Web框架。鉴于这些项目均由拥有5至10年经验的资深工程师设计,其规模与复杂度均处于较高水平,每个项目都极具挑战性。通常情况下,一名资深工程师完成单个项目需要耗费4至8小时。在我们提供的基准测试智能体(Web-Agent)上,当前最先进模型(SOTA)Claude 3.7 Sonnet的Pass@1指标仅为25.1%。
实验数据的分布情况与当前主流大语言模型的代码生成能力高度匹配。
<img width="500" alt="pass@1" src="./docs/assets/pass-1.png" />
HumanEval与MBPP已趋近性能饱和,APPS与EvalPlus也正逐步接近饱和。Web-Bench的当前最优性能为25.1%,该结果优于(数值越低则表现越好)SWE-bench的Full与Verified数据集的最优性能。
<img width="500" alt="SOTAs" src="./docs/assets/sotas.png" />
## Web-Bench:基于Web标准与框架的大语言模型代码基准测试集
该数据集已在论文[Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks](https://huggingface.co/papers/2505.07473)中发布。
## 🏅 排行榜
[排行榜](https://huggingface.co/spaces/bytedance-research/Web-Bench-Leaderboard)
## 数据集结构
Web-Bench的单条数据样例如下:
id: (str) 任务ID,格式为init | task-n
project: (str) 任务所属项目名称
description: (str) 任务详细说明
date: (str) 任务发布日期,用于过滤受污染的模型训练数据
level: (str) 任务难度等级:easy(简单) | moderate(中等) | challenging(困难)
## 📘 使用方式
[GitHub仓库](https://github.com/bytedance/web-bench)
提供机构:
maas
创建时间:
2025-08-25



