html-eval
收藏魔搭社区2026-01-07 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/nex-agi/html-eval
下载链接
链接失效反馈官方服务:
资源简介:
# HTML Generation Evaluation
## Overview
To evaluate our model's HTML generation performance, we have constructed a comprehensive evaluation dataset covering diverse web development scenarios. This dataset includes:
1. **Web Development**: Landing pages, e-commerce sites, responsive layouts, and financial dashboards
2. **3D Scene Design**: WebGL scenes, Three.js applications, and interactive 3D models
3. **Game Design**: Browser-based games, interactive gameplay, and physics-based games
4. **Physics Simulation**: Physics engines, particle systems, and molecular dynamics
5. **Data Visualization**: Charts, graphs, and interactive data displays
## Dataset Composition
Our evaluation dataset spans **5 distinct categories** with a total of **45 test cases**. The distribution across categories is as follows:
| Category | Description | Test Cases |
|----------|-------------|------------|
| **二维网站界面设计 (Web Development)** | Landing pages, e-commerce sites, responsive layouts, financial dashboards | 9 |
| **游戏设计 (Game Design)** | Browser-based games, interactive gameplay, infinite runners, 3D games | 10 |
| **物理模拟 (Physics Simulation)** | Physics engines, particle systems, molecular dynamics, motion simulations | 10 |
| **3D场景设计 (3D Scene Design)** | WebGL scenes, Three.js applications, interactive 3D models | 9 |
| **数据可视化 (Data Visualization)** | Charts, graphs, interactive data displays, dashboard components | 7 |
## Model Performance
Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in HTML generation tasks:
<div align="center">
<img src="./html_result.png" width="70%">
</div>
### Performance Summary
| Model Comparison | Nex-N1 Win | Tie | Nex-N1 Lose |
|-----------------|-----------|-----|------------|
| **vs Claude-Sonnet-4.5** | 26.5% | 16.3% | 57.1% |
| **vs GPT-5** | 53.1% | 16.3% | 30.6% |
| **vs Kimi-K2-Thinking** | 55.1% | 6.1% | 38.8% |
| **vs GLM-4.6** | 61.2% | 8.2% | 30.6% |
| **vs MiniMax-M2** | 69.4% | 8.2% | 22.4% |
## Repository Structure
### Data Files
- **`html_evaluation.xlsx`**: Complete dataset with task IDs, categories, and instructions for all test cases
- **`html/`**: Generated HTML files from all evaluated models
- **`assets/html_result.png`**: Visualization of benchmark results
### Evaluation Workflow
Each evaluation task is identified by a unique ID in the format `HTML-{number}` (e.g., `HTML-001`, `HTML-002`).
The evaluation process follows these steps:
1. **Task Identification**: Read task details from `html_evaluation.xlsx` using the task ID
2. **Model Execution**: Each model generates a complete HTML page from the natural language prompt
3. **HTML Generation**: Models produce single-file HTML with all CSS and JavaScript embedded
4. **Human Evaluation**: Outputs are compared pairwise between Nex-N1 and each competitor model
5. **Result Classification**: Each comparison is classified as Win, Tie, or Lose based on overall quality
#### Dataset Structure (`html_evaluation.xlsx`):
Each test case contains:
- **id**: Unique identifier (HTML-001 to HTML-045)
- **测试能力** (Test Capability): Category of the test
- **指令** (Instruction): Natural language prompt in English or Chinese
#### Generated HTML Files:
**Naming Convention**: `HTML-{ID}_{model}.html`
Example: `HTML-001_nex-n1.html`
# HTML生成评测
## 概览
为评测本模型的HTML生成性能,我们构建了覆盖多元Web开发场景的综合评测数据集,该数据集包含以下类别:
1. **Web开发(Web Development)**:着陆页、电商站点、响应式布局及金融数据仪表盘
2. **3D场景设计(3D Scene Design)**:WebGL场景、Three.js应用与交互式3D模型
3. **游戏设计(Game Design)**:浏览器端游戏、交互式玩法、无限跑酷类游戏、基于物理引擎的游戏
4. **物理模拟(Physics Simulation)**:物理引擎、粒子系统与分子动力学模拟
5. **数据可视化(Data Visualization)**:图表、图形与交互式数据展示
## 数据集构成
本评测数据集涵盖**5个独立类别**,总计**45个测试用例**,各类别分布如下:
| 类别 | 描述 | 测试用例数 |
|------|------|------------|
| **Web开发(Web Development)** | 着陆页、电商站点、响应式布局、金融数据仪表盘 | 9 |
| **游戏设计(Game Design)** | 浏览器端游戏、交互式玩法、无限跑酷类游戏、3D游戏 | 10 |
| **物理模拟(Physics Simulation)** | 物理引擎、粒子系统、分子动力学、运动模拟 | 10 |
| **3D场景设计(3D Scene Design)** | WebGL场景、Three.js应用、交互式3D模型 | 9 |
| **数据可视化(Data Visualization)** | 图表、图形、交互式数据展示、仪表盘组件 | 7 |
## 模型性能表现
本模型(Nex-N1)在所有评测场景中均展现出具备竞争力的性能,尤其在HTML生成任务中表现突出:
<div align="center">
<img src="./html_result.png" width="70%">
</div>
### 性能总结
| 模型对比 | Nex-N1胜率 | 平局率 | Nex-N1败率 |
|---------|-----------|-----|------------|
| **vs Claude-Sonnet-4.5** | 26.5% | 16.3% | 57.1% |
| **vs GPT-5** | 53.1% | 16.3% | 30.6% |
| **vs Kimi-K2-Thinking** | 55.1% | 6.1% | 38.8% |
| **vs GLM-4.6** | 61.2% | 8.2% | 30.6% |
| **vs MiniMax-M2** | 69.4% | 8.2% | 22.4% |
## 仓库结构
### 数据文件
- **`html_evaluation.xlsx`**:包含所有测试用例的任务ID、类别与指令的完整数据集
- **`html/`**:所有参评模型生成的HTML文件集合
- **`assets/html_result.png`**:基准评测结果可视化图
### 评测流程
每个评测任务均采用唯一ID标识,格式为`HTML-{编号}`(例如`HTML-001`、`HTML-002`)。
评测流程遵循以下步骤:
1. **任务识别**:通过任务ID从`html_evaluation.xlsx`中读取任务详情
2. **模型执行**:各模型根据自然语言提示生成完整HTML页面
3. **HTML生成**:模型生成内嵌所有CSS与JavaScript的单文件HTML
4. **人工评测**:将Nex-N1的输出与各竞品模型进行两两对比
5. **结果分类**:根据整体质量将每一组对比划分为「获胜」「平局」或「落败」
#### `html_evaluation.xlsx` 数据集结构:
每个测试用例包含:
- **id**:唯一标识符(HTML-001至HTML-045)
- **测试能力(Test Capability)**:测试所属类别
- **指令(Instruction)**:英文或中文自然语言提示
#### 生成的HTML文件
**命名规则**:`HTML-{ID}_{model}.html`
示例:`HTML-001_nex-n1.html`
提供机构:
maas
创建时间:
2025-11-19



