WebRenderBench

Name: WebRenderBench
Creator: maas
Published: 2025-12-05 17:46:59
License: 暂无描述

魔搭社区2025-12-05 更新2025-10-11 收录

下载链接：

https://modelscope.cn/datasets/lpc1290/WebRenderBench

下载链接

链接失效反馈

官方服务：

资源简介：

<img src="./docs/assets/logo.svg" alt="Logo" width="120" /> <a href="https://github.com/PKU-DAIR"> <img alt="Static Badge" src="https://img.shields.io/badge/%C2%A9-PKU--DAIR-%230e529d?labelColor=%23003985"> </a> ## **WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning** [Paper](https://arxiv.org/pdf/2510.04097) | [中文](./docs/Chinese.md) ## **🔍 Overview** **WebRenderBench** is a large-scale benchmark designed to advance **WebUI-to-Code** research for multimodal large language models (MLLMs) through evaluation on real-world webpages. It provides: * **45,100** real webpages collected from public portal websites * **High diversity and complexity**, covering a wide range of industries and design styles * **Novel evaluation metrics** that quantify **layout and style consistency** based on rendered pages * The **ALISA reinforcement learning framework**, which uses the new metrics as reward signals to optimize generation quality --- ## **🚀 Key Features** ### **Beyond the Limitations of Traditional Benchmarks** WebRenderBench addresses the core issues of existing WebUI-to-Code benchmarks in data quality and evaluation methodology: | Aspect | Traditional Benchmarks | Advantages of WebRenderBench | | :------------------------- | :---------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------- | | **Data Quality** | Small-scale, simple-structured, or LLM-synthesized data with limited diversity | Large-scale, real-world, and structurally complex webpages that present higher challenges | | **Evaluation Reliability** | Relies on visual APIs (high cost) or code-structure comparison (fails to handle code asymmetry) | Objectively and efficiently evaluates layout and style consistency based on rendered results | | **Training Effectiveness** | Difficult to optimize on crawled data with asymmetric code structures | Proposed metrics can be directly used as RL reward signals to enhance model optimization | --- ### **Dataset Characteristics** <img src="./docs/assets/framework.svg" alt="WebRenderBench and ALISA Framework" width="80%" /> Figure 1: Dataset construction pipeline and the ALISA framework Our dataset is constructed through a systematic process to ensure both **high quality** and **diversity**: 1. **Data Collection**: URLs are obtained from open enterprise portal datasets. A high-concurrency crawler captures 210K webpages along with static resources. 2. **Data Processing**: MHTML pages are converted into HTML files, and cross-domain resources are processed to ensure local renderability and full-page screenshots. 3. **Data Cleaning**: Pages with abnormal sizes, rendering errors, or missing styles are filtered out. Multimodal QA models further remove low-quality samples with large blank areas or overlapping elements, yielding 110K valid pages. 4. **Data Categorization**: Pages are categorized by industry and complexity (measured via *Group Count*) to ensure balanced distribution across difficulty levels and domains. Finally, we construct a dataset of **45.1K** samples, evenly split into training and test sets. --- ## **🌟 Evaluation Framework** We propose a novel evaluation protocol based on **rendered webpages**, quantifying model performance along two key dimensions: **layout** and **style consistency**. --- ### **RDA (Relative Layout Difference of Associated Elements)** **Purpose:** Measures relative layout differences between matched elements. * **Element Association:** Matches corresponding elements between generated and target pages using text similarity (LCS) and geometric distance. * **Positional Deviation:** The page is divided into a 3×3 grid. Associated elements are compared quadrant-wise—if located in different quadrants, the score is 0; otherwise, a deviation-based score is computed. * **Uniqueness Weighting:** Each element is weighted by its uniqueness (inverse group size), giving higher importance to distinctive components. --- ### **GDA (Group-wise Difference in Element Counts)** **Purpose:** Measures group-level alignment of axis-aligned elements. * **Grouping:** Elements aligned on the same horizontal or vertical axis are treated as one group. * **Count Comparison:** Compares whether corresponding groups in the generated and target pages contain the same number of elements. * **Uniqueness Weighting:** Weighted by element uniqueness to emphasize key structural alignment. --- ### **SDA (Style Difference of Associated Elements)** **Purpose:** Evaluates fine-grained style differences between associated elements. * **Multi-Dimensional Style Extraction:** Measures differences in foreground color, background color, font size, and border radius. * **Weighted Averaging:** Computes a weighted mean of style similarity scores across all associated elements to obtain an overall style score. --- ## **⚙️ Installation Guide** ### **Core Dependencies**  Coming Soon --- ## **📊 Benchmark Workflow** ### **Directory Structure** ``` |- docs/ # Documentation |- scripts # Evaluation scripts |- web_render_test.jsonl # Test set metadata |- web_render_train.jsonl # Training set metadata |- test_webpages.zip # Test set webpages |- train_webpages.zip # Training set webpages |- test_screenshots.zip # Test set screenshots |- train_screenshots.zip # Training set screenshots ``` --- ### **Implementation Steps** 1. **Data Preparation** * Download the WebRenderBench dataset and extract webpage and screenshot archives. * Each pair consists of a real webpage (HTML + resources) and its rendered screenshot. 2. **Model Inference** * Run inference using engines such as **vLLM** or **LLM Deploy**, and save results to the designated directory. 3. **Evaluation** * Run `scripts/1_get_evaluation.py`. * The script launches a web server to render both generated and target HTML. * WebDriver extracts DOM information and computes **RDA**, **GDA**, and **SDA** scores. * Results are saved under `save_results/`. * Final scores are aggregated via `scripts/2_compute_alisa_scores.py`. 4. **ALISA Training (Optional)** * Use `models/train_rl.py` for reinforcement learning fine-tuning. *(Coming Soon)* * The computed evaluation scores serve as reward signals to optimize policy models via methods such as **GRPO**. --- ## **📈 Model Performance Insights** We evaluate **17 multimodal large language models** of varying scales and architectures (both open- and closed-source). * **Combined Scores of RDA, GDA, and SDA (%)** ![Inference Results](./docs/assets/inference_results.png) **Key Findings:** * Overall, larger models achieve higher consistency. **GPT-4.1-mini** and **Qwen-VL-Plus** perform best among closed-source models. * While most models perform reasonably on simple pages (*Group Count* < 50), **RDA scores drop sharply** as page complexity increases—precise layout alignment remains a major challenge. * After reinforcement learning via the **ALISA framework**, **Qwen2.5-VL-7B** shows substantial improvements across all complexity levels, even surpassing **GPT-4.1-mini** on simpler cases. --- ## **📅 Future Work** * [ ] Release pretrained models fine-tuned with the ALISA framework * [ ] Expand dataset coverage to more industries and dynamic interaction patterns * [ ] Open-source the complete toolchain for data collection, cleaning, and evaluation --- ## **📜 License** The **WebRenderBench dataset** is released for **research purposes only**. All accompanying code will be published under the **Apache License 2.0**. All webpages in the dataset are collected from publicly accessible enterprise portals. To protect privacy, all personal and sensitive information has been removed or modified. --- ## **📚 Citation** If you use our dataset or framework in your research, please cite the following paper: ```bibtex @article{webrenderbench2025, title={WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning}, author={Anonymous Author(s)}, year={2025}, journal={arXiv preprint}, } ```

<img src="./docs/assets/logo.svg" alt="标志" width="120" /> <a href="https://github.com/PKU-DAIR"> <img alt="静态徽章" src="https://img.shields.io/badge/%C2%A9-PKU--DAIR-%230e529d?labelColor=%23003985"> </a> ## **WebRenderBench：基于布局-风格一致性与强化学习的Web界面生成优化** [论文](https://arxiv.org/pdf/2510.04097) | [中文文档](./docs/Chinese.md) ## **🔍 概述** **WebRenderBench是一款大规模基准测试集，旨在通过真实网页场景下的评估，推动面向多模态大语言模型（Multimodal Large Language Model, MLLM）的WebUI转代码（WebUI-to-Code）研究发展。该基准集包含以下内容：** * **45100** 条从公开门户网站采集的真实网页 * **高多样性与复杂度**，覆盖多行业与多元设计风格 * **全新评估指标**，可基于渲染后的网页量化布局与风格一致性 * **ALISA强化学习框架**，以该新型指标作为奖励信号优化生成质量 --- ## **🚀 核心特性** ### **突破传统基准测试集的局限** WebRenderBench针对现有WebUI转代码基准测试集在数据质量与评估方法论层面的核心痛点进行优化： | 评估维度 | 传统基准测试集 | WebRenderBench的优势 | | :------------------------- | :---------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------- | | **数据质量** | 规模小、结构单一，或由大语言模型合成且多样性有限 | 采用大规模真实网页数据，结构复杂且挑战度更高 | | **评估可靠性** | 依赖视觉API（成本高昂）或代码结构对比（无法处理代码不对称问题） | 基于渲染结果客观高效地评估布局与风格一致性 | | **训练有效性** | 难以基于存在代码结构不对称的爬取数据进行模型优化 | 所提出的指标可直接用作强化学习奖励信号，提升模型优化效果 | --- ### **数据集特性** <img src="./docs/assets/framework.svg" alt="WebRenderBench与ALISA框架" width="80%" /> 图1：数据集构建流程与ALISA框架 本数据集通过系统化流程构建，以兼顾**高质量**与**多样性**： 1. **数据采集**：从公开企业门户数据集获取URL，通过高并发爬虫采集21万条网页及其静态资源。 2. **数据处理**：将MHTML格式网页转换为HTML文件，并处理跨域资源以确保本地可渲染性与全页截图的生成。 3. **数据清洗**：过滤掉尺寸异常、渲染出错或样式缺失的网页；再通过多模态问答模型剔除存在大面积空白或元素重叠的低质量样本，最终得到11万条有效网页。 4. **数据分类**：按行业与复杂度（通过*元素组数量*衡量）对网页进行分类，确保不同难度等级与领域的样本分布均衡。最终构建得到包含**4.51万**条样本的数据集，按均等比例划分为训练集与测试集。 --- ## **🌟 评估框架** 我们提出了一种基于**渲染后网页**的新型评估范式，从**布局一致性**与**风格一致性**两个核心维度量化模型性能。 --- ### **RDA（关联元素相对布局差异）** **用途**：衡量匹配元素间的相对布局差异。 * **元素关联**：通过文本相似度（最长公共子序列，LCS）与几何距离匹配生成页面与目标页面中的对应元素。 * **位置偏差**：将页面划分为3×3网格，按象限对关联元素进行对比：若元素处于不同象限则得分为0，否则基于偏差计算得分。 * **唯一性加权**：按元素的唯一性（元素组规模的倒数）为每个元素赋予权重，突出差异化组件的重要性。 --- ### **GDA（元素组数量组间差异）** **用途**：衡量轴对齐元素的组级对齐情况。 * **元素分组**：将同一水平或垂直轴上对齐的元素划分为一个组。 * **数量对比**：对比生成页面与目标页面中对应组的元素数量是否一致。 * **唯一性加权**：基于元素唯一性赋予权重，突出关键结构的对齐要求。 --- ### **SDA（关联元素风格差异）** **用途**：评估关联元素间的细粒度风格差异。 * **多维度风格提取**：衡量前景色、背景色、字体大小与边框圆角的差异。 * **加权平均**：对所有关联元素的风格相似度得分进行加权平均，得到整体风格得分。 --- ## **⚙️ 安装指南** ### **核心依赖**  即将推出 --- ## **📊 基准测试工作流** ### **目录结构** |- docs/ # 文档目录 |- scripts # 评估脚本目录 |- web_render_test.jsonl # 测试集元数据文件 |- web_render_train.jsonl # 训练集元数据文件 |- test_webpages.zip # 测试集网页文件压缩包 |- train_webpages.zip # 训练集网页文件压缩包 |- test_screenshots.zip # 测试集截图压缩包 |- train_screenshots.zip # 训练集截图压缩包 --- ### **实施步骤** 1. **数据准备** * 下载WebRenderBench数据集并解压网页与截图压缩包。 * 每一组样本包含一份真实网页（HTML文件与配套资源）及其渲染后的截图。 2. **模型推理** * 使用vLLM或LLM Deploy等推理引擎运行推理流程，并将结果保存至指定目录。 3. **模型评估** * 运行`scripts/1_get_evaluation.py`脚本。 * 该脚本将启动Web服务器以渲染生成页面与目标页面的HTML代码。 * 通过WebDriver提取DOM信息并计算**RDA**、**GDA**与**SDA**得分。 * 评估结果将保存至`save_results/`目录。 * 通过`scripts/2_compute_alisa_scores.py`脚本聚合最终得分。 4. **ALISA训练（可选）** * 使用`models/train_rl.py`进行强化学习微调。*(即将推出)* * 所计算得到的评估得分将作为奖励信号，通过**GRPO**等方法优化策略模型。 --- ## **📈 模型性能分析** 我们针对17款不同规模、不同架构的多模态大语言模型（包含开源与闭源模型）开展了评估。 * **RDA、GDA与SDA综合得分（%）** ![推理结果图](./docs/assets/inference_results.png) **核心发现：** * 总体而言，模型规模越大，布局与风格一致性表现越好。在闭源模型中，**GPT-4.1-mini**与**Qwen-VL-Plus**表现最优。 * 多数模型在简单页面（*元素组数量*<50）上表现尚可，但随着页面复杂度提升，**RDA得分会急剧下降**，精准的布局对齐仍是一大挑战。 * 通过**ALISA框架**进行强化学习微调后，**Qwen2.5-VL-7B**在所有复杂度等级的页面上均实现了显著性能提升，在简单页面场景下甚至超越了**GPT-4.1-mini**。 --- ## **📅 未来工作计划** * [ ] 发布基于ALISA框架微调的预训练模型 * [ ] 扩大数据集覆盖范围，增加更多行业与动态交互场景 * [ ] 开源完整的数据采集、清洗与评估工具链 --- ## **📜 许可证** **WebRenderBench数据集**仅用于学术研究用途。所有配套代码将以**Apache许可证2.0（Apache License 2.0）**协议开源。数据集中的所有网页均来自公开可访问的企业门户网站。为保护用户隐私，所有个人与敏感信息均已被移除或修改。 --- ## **📚 引用声明** 若您在研究中使用本数据集或框架，请引用以下论文： bibtex @article{webrenderbench2025, title={WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning}, author={Anonymous Author(s)}, year={2025}, journal={arXiv preprint}, }

提供机构：

maas

创建时间：

2025-10-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集