five

html-eval

收藏
魔搭社区2026-01-07 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/nex-agi/html-eval
下载链接
链接失效反馈
官方服务:
资源简介:
# HTML Generation Evaluation ## Overview To evaluate our model's HTML generation performance, we have constructed a comprehensive evaluation dataset covering diverse web development scenarios. This dataset includes: 1. **Web Development**: Landing pages, e-commerce sites, responsive layouts, and financial dashboards 2. **3D Scene Design**: WebGL scenes, Three.js applications, and interactive 3D models 3. **Game Design**: Browser-based games, interactive gameplay, and physics-based games 4. **Physics Simulation**: Physics engines, particle systems, and molecular dynamics 5. **Data Visualization**: Charts, graphs, and interactive data displays ## Dataset Composition Our evaluation dataset spans **5 distinct categories** with a total of **45 test cases**. The distribution across categories is as follows: | Category | Description | Test Cases | |----------|-------------|------------| | **二维网站界面设计 (Web Development)** | Landing pages, e-commerce sites, responsive layouts, financial dashboards | 9 | | **游戏设计 (Game Design)** | Browser-based games, interactive gameplay, infinite runners, 3D games | 10 | | **物理模拟 (Physics Simulation)** | Physics engines, particle systems, molecular dynamics, motion simulations | 10 | | **3D场景设计 (3D Scene Design)** | WebGL scenes, Three.js applications, interactive 3D models | 9 | | **数据可视化 (Data Visualization)** | Charts, graphs, interactive data displays, dashboard components | 7 | ## Model Performance Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in HTML generation tasks: <div align="center"> <img src="./html_result.png" width="70%"> </div> ### Performance Summary | Model Comparison | Nex-N1 Win | Tie | Nex-N1 Lose | |-----------------|-----------|-----|------------| | **vs Claude-Sonnet-4.5** | 26.5% | 16.3% | 57.1% | | **vs GPT-5** | 53.1% | 16.3% | 30.6% | | **vs Kimi-K2-Thinking** | 55.1% | 6.1% | 38.8% | | **vs GLM-4.6** | 61.2% | 8.2% | 30.6% | | **vs MiniMax-M2** | 69.4% | 8.2% | 22.4% | ## Repository Structure ### Data Files - **`html_evaluation.xlsx`**: Complete dataset with task IDs, categories, and instructions for all test cases - **`html/`**: Generated HTML files from all evaluated models - **`assets/html_result.png`**: Visualization of benchmark results ### Evaluation Workflow Each evaluation task is identified by a unique ID in the format `HTML-{number}` (e.g., `HTML-001`, `HTML-002`). The evaluation process follows these steps: 1. **Task Identification**: Read task details from `html_evaluation.xlsx` using the task ID 2. **Model Execution**: Each model generates a complete HTML page from the natural language prompt 3. **HTML Generation**: Models produce single-file HTML with all CSS and JavaScript embedded 4. **Human Evaluation**: Outputs are compared pairwise between Nex-N1 and each competitor model 5. **Result Classification**: Each comparison is classified as Win, Tie, or Lose based on overall quality #### Dataset Structure (`html_evaluation.xlsx`): Each test case contains: - **id**: Unique identifier (HTML-001 to HTML-045) - **测试能力** (Test Capability): Category of the test - **指令** (Instruction): Natural language prompt in English or Chinese #### Generated HTML Files: **Naming Convention**: `HTML-{ID}_{model}.html` Example: `HTML-001_nex-n1.html`

# HTML生成评测 ## 概览 为评测本模型的HTML生成性能,我们构建了覆盖多元Web开发场景的综合评测数据集,该数据集包含以下类别: 1. **Web开发(Web Development)**:着陆页、电商站点、响应式布局及金融数据仪表盘 2. **3D场景设计(3D Scene Design)**:WebGL场景、Three.js应用与交互式3D模型 3. **游戏设计(Game Design)**:浏览器端游戏、交互式玩法、无限跑酷类游戏、基于物理引擎的游戏 4. **物理模拟(Physics Simulation)**:物理引擎、粒子系统与分子动力学模拟 5. **数据可视化(Data Visualization)**:图表、图形与交互式数据展示 ## 数据集构成 本评测数据集涵盖**5个独立类别**,总计**45个测试用例**,各类别分布如下: | 类别 | 描述 | 测试用例数 | |------|------|------------| | **Web开发(Web Development)** | 着陆页、电商站点、响应式布局、金融数据仪表盘 | 9 | | **游戏设计(Game Design)** | 浏览器端游戏、交互式玩法、无限跑酷类游戏、3D游戏 | 10 | | **物理模拟(Physics Simulation)** | 物理引擎、粒子系统、分子动力学、运动模拟 | 10 | | **3D场景设计(3D Scene Design)** | WebGL场景、Three.js应用、交互式3D模型 | 9 | | **数据可视化(Data Visualization)** | 图表、图形、交互式数据展示、仪表盘组件 | 7 | ## 模型性能表现 本模型(Nex-N1)在所有评测场景中均展现出具备竞争力的性能,尤其在HTML生成任务中表现突出: <div align="center"> <img src="./html_result.png" width="70%"> </div> ### 性能总结 | 模型对比 | Nex-N1胜率 | 平局率 | Nex-N1败率 | |---------|-----------|-----|------------| | **vs Claude-Sonnet-4.5** | 26.5% | 16.3% | 57.1% | | **vs GPT-5** | 53.1% | 16.3% | 30.6% | | **vs Kimi-K2-Thinking** | 55.1% | 6.1% | 38.8% | | **vs GLM-4.6** | 61.2% | 8.2% | 30.6% | | **vs MiniMax-M2** | 69.4% | 8.2% | 22.4% | ## 仓库结构 ### 数据文件 - **`html_evaluation.xlsx`**:包含所有测试用例的任务ID、类别与指令的完整数据集 - **`html/`**:所有参评模型生成的HTML文件集合 - **`assets/html_result.png`**:基准评测结果可视化图 ### 评测流程 每个评测任务均采用唯一ID标识,格式为`HTML-{编号}`(例如`HTML-001`、`HTML-002`)。 评测流程遵循以下步骤: 1. **任务识别**:通过任务ID从`html_evaluation.xlsx`中读取任务详情 2. **模型执行**:各模型根据自然语言提示生成完整HTML页面 3. **HTML生成**:模型生成内嵌所有CSS与JavaScript的单文件HTML 4. **人工评测**:将Nex-N1的输出与各竞品模型进行两两对比 5. **结果分类**:根据整体质量将每一组对比划分为「获胜」「平局」或「落败」 #### `html_evaluation.xlsx` 数据集结构: 每个测试用例包含: - **id**:唯一标识符(HTML-001至HTML-045) - **测试能力(Test Capability)**:测试所属类别 - **指令(Instruction)**:英文或中文自然语言提示 #### 生成的HTML文件 **命名规则**:`HTML-{ID}_{model}.html` 示例:`HTML-001_nex-n1.html`
提供机构:
maas
创建时间:
2025-11-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作