Web2Code
收藏魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/MBZUAI/Web2Code
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Details
Our Web2Code instruction tuning dataset construction and instruction generation process involves four key components: (1) Creation of new webpage image-code pair data: We generated high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and convert them into instruction-following data. (2) Refinement of existing webpage code generation data: We transform existing datasets including into an instruction-following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs. (3) Creation of a new text question-answer pair data: We generated a new question-answer pair dataset utilizing our new GPT-3.5 generated data from (1) for webpage understanding. (4) Refinement of existing webpage understanding data: We refine the WebSRC question-answer data to improve its quality using the GPT-4. More detail can be obtained in [[Web2Code](https://arxiv.org/abs/2406.20098)]
**Resources**: [[Paper](https://arxiv.org/abs/2406.20098)] [[Project Page](https://mbzuai-llm.github.io/webpage2code/)] [[Web2Code Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code)][[Croissant](https://huggingface.co/api/datasets/the-Lin/Web2Code/croissant)]
## Image Folder Structure
```
Web2Code_image
├── games
│ ├── 01
│ ├── ...
│ └── 09
├── jobs
│ ├── 03
│ ├── ...
│ └── 13
...
```
## Data Fields
```
{
'id': '99720969-917D-4843-BB69-D09AF953F258',
'image': 'pix2code/99720969-917D-4843-BB69-D09AF953F258.png',
'conversations': [
{'from': 'human', 'value': '<image>\nUse the webpage screenshot to generate HTML code as a replication of its structure. Manifest the code following Bootstrap layout.'},
{'from': 'gpt', 'value': '<html>\n<header>\n<meta charset="utf-8"/>\n<meta content="width=device-width, initial-scale=1" name="viewport"/>\n<link crossorigin="anonymous" ...'}
]
}
```
## Statistic
<table>
<tr>
<th></th> <th>data</th> <th>image</th>
</tr>
<tr>
<th>train</th> <th>827934</th> <th>815293</th>
</tr>
<tr>
<th>eval</th> <th>5990</th> <th>1198</th>
</tr>
</table>
## License
 **Usage and License Notices**: Usage and License Notices: The data is intended and licensed for research use only. The dataset is CC BY 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
# 数据集详情
本Web2Code指令微调数据集的构建与指令生成流程包含四大核心组件:
1. 新型网页图像-代码配对数据构建:我们遵循CodeAlpaca提示(CodeAlpaca prompt),利用GPT-3.5生成高质量HTML网页-代码配对数据,并将其转换为指令跟随格式数据。
2. 现有网页代码生成数据优化:我们将现有数据集转换为与LLaVA数据类似的指令跟随格式,使其可用于训练多模态大语言模型(MLLMs)。
3. 新型文本问答配对数据构建:我们基于步骤(1)中GPT-3.5生成的网页相关数据,构建全新的问答配对数据集,用于网页理解任务。
4. 现有网页理解数据优化:我们使用GPT-4对WebSRC问答数据进行质量提升优化。更多细节可参阅[[Web2Code](https://arxiv.org/abs/2406.20098)]
**资源链接**:[[论文](https://arxiv.org/abs/2406.20098)] [[项目主页](https://mbzuai-llm.github.io/webpage2code/)] [[Web2Code数据集](https://huggingface.co/datasets/MBZUAI/Web2Code)][[Croissant数据集文件](https://huggingface.co/api/datasets/the-Lin/Web2Code/croissant)]
## 图像文件夹结构
Web2Code_image
├── games
│ ├── 01
│ ├── ...
│ └── 09
├── jobs
│ ├── 03
│ ├── ...
│ └── 13
...
## 数据字段
{
"id": "99720969-917D-4843-BB69-D09AF953F258",
"image": "pix2code/99720969-917D-4843-BB69-D09AF953F258.png",
"conversations": [
{"from": "human", "value": "<image>
请使用该网页截图生成与其结构一致的HTML代码,需遵循Bootstrap布局规范编写代码。"},
{"from": "gpt", "value": "<html>
<header>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link crossorigin="anonymous" ..."}
]
}
## 统计信息
<table>
<tr>
<th></th> <th>数据量</th> <th>图像数</th>
</tr>
<tr>
<th>训练集</th> <th>827934</th> <th>815293</th>
</tr>
<tr>
<th>验证集</th> <th>5990</th> <th>1198</th>
</tr>
</table>
## 许可证
 **使用与许可声明**:本数据集仅可用于研究用途,采用CC BY 4.0许可协议(仅允许非商业使用),且基于本数据集训练的模型不得超出研究用途范围使用。
提供机构:
maas
创建时间:
2025-03-17



