five

Web2Code

收藏
魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/MBZUAI/Web2Code
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Details Our Web2Code instruction tuning dataset construction and instruction generation process involves four key components: (1) Creation of new webpage image-code pair data: We generated high-quality HTML webpage-code pairs following the CodeAlpaca prompt using GPT-3.5 and convert them into instruction-following data. (2) Refinement of existing webpage code generation data: We transform existing datasets including into an instruction-following data format similar to LLaVA data, so they can be used as instruction-following data to train MLLMs. (3) Creation of a new text question-answer pair data: We generated a new question-answer pair dataset utilizing our new GPT-3.5 generated data from (1) for webpage understanding. (4) Refinement of existing webpage understanding data: We refine the WebSRC question-answer data to improve its quality using the GPT-4. More detail can be obtained in [[Web2Code](https://arxiv.org/abs/2406.20098)] **Resources**: [[Paper](https://arxiv.org/abs/2406.20098)] [[Project Page](https://mbzuai-llm.github.io/webpage2code/)] [[Web2Code Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code)][[Croissant](https://huggingface.co/api/datasets/the-Lin/Web2Code/croissant)] ## Image Folder Structure ``` Web2Code_image ├── games │ ├── 01 │ ├── ... │ └── 09 ├── jobs │ ├── 03 │ ├── ... │ └── 13 ... ``` ## Data Fields ``` { 'id': '99720969-917D-4843-BB69-D09AF953F258', 'image': 'pix2code/99720969-917D-4843-BB69-D09AF953F258.png', 'conversations': [ {'from': 'human', 'value': '<image>\nUse the webpage screenshot to generate HTML code as a replication of its structure. Manifest the code following Bootstrap layout.'}, {'from': 'gpt', 'value': '<html>\n<header>\n<meta charset="utf-8"/>\n<meta content="width=device-width, initial-scale=1" name="viewport"/>\n<link crossorigin="anonymous" ...'} ] } ``` ## Statistic <table> <tr> <th></th> <th>data</th> <th>image</th> </tr> <tr> <th>train</th> <th>827934</th> <th>815293</th> </tr> <tr> <th>eval</th> <th>5990</th> <th>1198</th> </tr> </table> ## License ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%204.0-red.svg) **Usage and License Notices**: Usage and License Notices: The data is intended and licensed for research use only. The dataset is CC BY 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

# 数据集详情 本Web2Code指令微调数据集的构建与指令生成流程包含四大核心组件: 1. 新型网页图像-代码配对数据构建:我们遵循CodeAlpaca提示(CodeAlpaca prompt),利用GPT-3.5生成高质量HTML网页-代码配对数据,并将其转换为指令跟随格式数据。 2. 现有网页代码生成数据优化:我们将现有数据集转换为与LLaVA数据类似的指令跟随格式,使其可用于训练多模态大语言模型(MLLMs)。 3. 新型文本问答配对数据构建:我们基于步骤(1)中GPT-3.5生成的网页相关数据,构建全新的问答配对数据集,用于网页理解任务。 4. 现有网页理解数据优化:我们使用GPT-4对WebSRC问答数据进行质量提升优化。更多细节可参阅[[Web2Code](https://arxiv.org/abs/2406.20098)] **资源链接**:[[论文](https://arxiv.org/abs/2406.20098)] [[项目主页](https://mbzuai-llm.github.io/webpage2code/)] [[Web2Code数据集](https://huggingface.co/datasets/MBZUAI/Web2Code)][[Croissant数据集文件](https://huggingface.co/api/datasets/the-Lin/Web2Code/croissant)] ## 图像文件夹结构 Web2Code_image ├── games │ ├── 01 │ ├── ... │ └── 09 ├── jobs │ ├── 03 │ ├── ... │ └── 13 ... ## 数据字段 { "id": "99720969-917D-4843-BB69-D09AF953F258", "image": "pix2code/99720969-917D-4843-BB69-D09AF953F258.png", "conversations": [ {"from": "human", "value": "<image> 请使用该网页截图生成与其结构一致的HTML代码,需遵循Bootstrap布局规范编写代码。"}, {"from": "gpt", "value": "<html> <header> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <link crossorigin="anonymous" ..."} ] } ## 统计信息 <table> <tr> <th></th> <th>数据量</th> <th>图像数</th> </tr> <tr> <th>训练集</th> <th>827934</th> <th>815293</th> </tr> <tr> <th>验证集</th> <th>5990</th> <th>1198</th> </tr> </table> ## 许可证 ![数据许可证](https://img.shields.io/badge/Data%20License-CC%20By%204.0-red.svg) **使用与许可声明**:本数据集仅可用于研究用途,采用CC BY 4.0许可协议(仅允许非商业使用),且基于本数据集训练的模型不得超出研究用途范围使用。
提供机构:
maas
创建时间:
2025-03-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作