haidark1/WebSightDescribed

Name: haidark1/WebSightDescribed
Creator: haidark1
Published: 2024-06-04 12:50:56
License: 暂无描述

Hugging Face2024-06-04 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/haidark1/WebSightDescribed

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 size_categories: - 100K<n<1M pretty_name: WebSightDescribed dataset_info: - config_name: v0.1 features: - name: image dtype: image - name: html dtype: string - name: nl_description dtype: string - name: id dtype: string splits: - name: train num_bytes: 45056592 num_examples: 526781 - name: valid num_bytes: 394432 num_examples: 4733 - name: test num_bytes: 16496 num_examples: 200 download_size: 144861710051 dataset_size: 368943620718.125 configs: - config_name: v0.1 data_files: - split: train path: wsd_data/train/data-* - split: valid path: wsd_data/valid/data-* - split: test path: wsd_data/test/data-* tags: - code - synthetic --- # Dataset Card for WebSightDescribed ## Dataset Description WebSightDescribed is a subset of [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight), augmenting the dataset with synthetically generated natural language descriptions of the websites. This dataset serves as a valuable resource for the task of generating html code from a natural language description. <details> <summary>Details for WebSightDescribed</summary> ## Data Fields An example of a sample appears as follows: ``` { 'image': PIL.Image, 'id': int, 'html': '<html>\n<style>\n{css}</style>\n{body}\n</html>', 'description': 'a natural language description of the UI' } ``` where `css` is the CSS code, and `body` is the body of the HTML code. In other words, the CSS code is embedded directly within the HTML code, facilitating the straightforward training of a model. The `id` field corresponds to the row number from [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight). ## Data Splits There are three splits, `train`, `valid`, and `test`, that contains 531,714 images, descriptions, and codes. ## Dataset Creation In addition to the steps used to create [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight), we used gpt=3.5-turbo to generate natural language descriptions of the UI represented by the html code. The following system prompt was used: ``` You are an AI assistant that specializes in HTML code. You are able to read HTML code and visualize the rendering of the HTML on a standard browser. When asked to write descriptions of HTML code, you describe how the user interface looks rendered in a standard browser (like Google Chrome). The user will provide you with HTML code and you will respond describing exactly how the code looks if rendered in a browser. Describe the colors exactly. Repeat ALL the text in the HTML code in your description. This is important - in your description do NOT omit any text rendered by the HTML code. Finally write your description like a customer describing a UI for a developer. Avoid any and all pleasantries, write the description like a straightforward description of the UI. ``` The html code was provided as the one and only user message and the response was saved as the natural language description. </details> ## Terms of Use By using the dataset, you agree to comply with the original licenses of the source content as well as the dataset license (CC-BY-4.0). Additionally, if you use this dataset to train a Machine Learning model, you agree to disclose your use of the dataset when releasing the model or an ML application using the model. ### Licensing Information License CC-BY-4.0. ### Citation Information If you are using this dataset, please cite this dataset and the original WebSight [technical report](https://arxiv.org/abs/2403.09029) ``` @misc{khan2024described, title={WebSightDescribed: Natural language description to UI}, author={Haidar Khan}, year={2024}, url={https://huggingface.co/datasets/haidark1/WebSightDescribed} } @misc{laurençon2024unlocking, title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset}, author={Hugo Laurençon and Léo Tronchon and Victor Sanh}, year={2024}, eprint={2403.09029}, archivePrefix={arXiv}, primaryClass={cs.HC} } ```

--- 语言: - 英语许可证: cc-by-4.0 规模类别: - 100K<n<1M 展示名称: WebSightDescribed 数据集信息: - 配置名称: v0.1 特征: - 字段名: 图像（image）数据类型: 图像 - 字段名: HTML代码（html）数据类型: 字符串 - 字段名: 自然语言描述（nl_description）数据类型: 字符串 - 字段名: 标识符（id）数据类型: 字符串划分: - 划分名称: 训练集（train）字节数: 45056592 样本数: 526781 - 划分名称: 验证集（valid）字节数: 394432 样本数: 4733 - 划分名称: 测试集（test）字节数: 16496 样本数: 200 下载大小: 144861710051 数据集总大小: 368943620718.125 配置项: - 配置名称: v0.1 数据文件: - 划分: 训练集（train）路径: wsd_data/train/data-* - 划分: 验证集（valid）路径: wsd_data/valid/data-* - 划分: 测试集（test）路径: wsd_data/test/data-* 标签: - 代码（code） - 合成数据（synthetic） --- # WebSightDescribed 数据集卡片 ## 数据集概述 WebSightDescribed 是 [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight) 的子集，为该数据集增补了针对各类网站的合成生成式自然语言描述。本数据集可作为「基于自然语言描述生成HTML代码」任务的优质研究资源。 <details> <summary>WebSightDescribed 详细说明</summary> ## 数据字段单条样本的示例格式如下： { 'image': PIL.Image, 'id': int, 'html': '<html> <style> {css}</style> {body} </html>', 'description': 'a natural language description of the UI' } 其中`css`为级联样式表（CSS）代码，`body`为HTML代码的主体部分。换言之，CSS代码直接内嵌于HTML代码中，可方便模型进行直接训练。`id`字段对应原 [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight) 数据集的行号。 ## 数据划分本数据集包含训练集、验证集与测试集三个划分，共计531,714条图像、描述与代码样本。 ## 数据集构建除构建 [WebSight v0.1](https://huggingface.co/datasets/HuggingFaceM4/WebSight) 所采用的步骤外，我们还使用GPT-3.5-turbo大语言模型（Large Language Model, LLM）生成了基于HTML代码所表征的用户界面（User Interface, UI）的自然语言描述。本次实验使用了如下系统提示词：您是一名专精于HTML代码的AI助手，能够阅读HTML代码并可视化该代码在标准浏览器中的渲染效果。当被要求描述HTML代码时，请描述该HTML代码在标准浏览器（如谷歌浏览器（Google Chrome））中渲染后的用户界面外观，准确描述配色细节，并在描述中完整复现HTML代码中的所有文本内容——这一点至关重要，请勿遗漏任何由HTML代码渲染的文本。最后，请以客户向开发人员描述用户界面的口吻撰写描述，避免任何客套话语，以直白简洁的方式描述界面。将HTML代码作为唯一的用户输入消息，模型的响应即被保存为自然语言描述。 </details> ## 使用条款使用本数据集即表示您同意遵守源内容的原始许可协议以及本数据集的许可协议（CC-BY-4.0）。此外，若您使用本数据集训练机器学习（Machine Learning, ML）模型，在发布该模型或基于该模型开发的机器学习应用时，需披露本数据集的使用情况。 ### 许可信息本数据集采用CC-BY-4.0许可协议。 ### 引用说明若您使用本数据集，请引用本数据集以及原始WebSight的[技术报告](https://arxiv.org/abs/2403.09029)： @misc{khan2024described, title={WebSightDescribed: Natural language description to UI}, author={Haidar Khan}, year={2024}, url={https://huggingface.co/datasets/haidark1/WebSightDescribed} } @misc{laurençon2024unlocking, title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset}, author={Hugo Laurençon and Léo Tronchon and Victor Sanh}, year={2024}, eprint={2403.09029}, archivePrefix={arXiv}, primaryClass={cs.HC} }

提供机构：

haidark1

原始信息汇总

数据集卡片 for WebSightDescribed

数据集描述

WebSightDescribed 是 WebSight v0.1 的一个子集，增加了对网站的合成自然语言描述。

该数据集对于从自然语言描述生成 HTML 代码的任务是一个宝贵的资源。

数据字段

一个样本示例如下： json { "image": "PIL.Image", "id": "int", "html": "<html> <style> {css}</style> {body} </html>", "description": "a natural language description of the UI" }

其中 css 是 CSS 代码，body 是 HTML 代码的主体。换句话说，CSS 代码直接嵌入在 HTML 代码中，便于模型的直接训练。id 字段对应于 WebSight v0.1 中的行号。

数据分割

数据集分为三个部分：train、valid 和 test，包含 531,714 张图片、描述和代码。

数据集创建

除了创建 WebSight v0.1 的步骤外，我们使用 gpt-3.5-turbo 生成 HTML 代码所代表的用户界面的自然语言描述。以下是使用的系统提示：

You are an AI assistant that specializes in HTML code. You are able to read HTML code and visualize the rendering of the HTML on a standard browser. When asked to write descriptions of HTML code, you describe how the user interface looks rendered in a standard browser (like Google Chrome). The user will provide you with HTML code and you will respond describing exactly how the code looks if rendered in a browser. Describe the colors exactly. Repeat ALL the text in the HTML code in your description. This is important - in your description do NOT omit any text rendered by the HTML code. Finally write your description like a customer describing a UI for a developer. Avoid any and all pleasantries, write the description like a straightforward description of the UI.

HTML 代码作为唯一的用户消息提供，响应保存为自然语言描述。

使用条款

使用该数据集时，您同意遵守源内容的原始许可证以及数据集许可证（CC-BY-4.0）。此外，如果您使用此数据集训练机器学习模型，您同意在发布模型或使用该模型的 ML 应用程序时披露您对该数据集的使用。

许可信息

许可证：CC-BY-4.0。

引用信息

如果您使用此数据集，请引用此数据集和原始 WebSight 技术报告：

bibtex @misc{khan2024described, title={WebSightDescribed: Natural language description to UI}, author={Haidar Khan}, year={2024}, url={https://huggingface.co/datasets/haidark1/WebSightDescribed} }

@misc{laurençon2024unlocking, title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset}, author={Hugo Laurençon and Léo Tronchon and Victor Sanh}, year={2024}, eprint={2403.09029}, archivePrefix={arXiv}, primaryClass={cs.HC} }

搜集汇总

数据集介绍

构建方式

在网页界面生成领域，WebSightDescribed数据集作为WebSight v0.1的子集，通过系统化增强策略构建而成。其核心方法在于利用GPT-3.5-turbo模型，对原始HTML代码所呈现的用户界面进行自然语言描述生成。构建过程中，研究者采用特定系统提示，要求模型以标准浏览器渲染视角，精确描述界面色彩与文本内容，确保生成的描述全面覆盖UI视觉元素。这一过程将CSS代码直接嵌入HTML结构，形成包含图像、代码与描述的三元组样本，最终构建出包含53万余条数据的大规模语料库。

使用方法

在实践应用中，该数据集主要服务于从自然语言描述生成HTML代码的研究任务。使用者可通过HuggingFace平台直接加载数据集，获取包含图像、代码与描述字段的标准格式数据。研究人员可基于训练集样本，构建端到端的视觉-语言-代码生成模型，利用验证集进行超参数调优，最终通过测试集评估模型性能。数据集支持直接应用于多模态Transformer架构的训练，为网页界面自动生成、代码辅助开发等前沿研究方向提供了可靠的实验基础。

背景与挑战

背景概述

在人工智能与前端开发交叉领域，自动化代码生成技术正逐步革新传统网页构建流程。WebSightDescribed数据集由Haidar Khan于2024年基于WebSight v0.1数据集构建，其核心研究聚焦于通过自然语言描述驱动网页界面代码的自动生成。该数据集依托HuggingFaceM4机构的前期工作，通过合成方法为网页截图与HTML代码对增添了丰富的语言描述，旨在推动多模态模型在理解视觉界面与生成结构化代码方面的能力演进。作为连接自然语言处理与计算机视觉的桥梁，该资源显著促进了智能网页开发工具的研究，为自动化界面设计提供了关键数据支撑。

当前挑战

该数据集致力于解决从自然语言描述到网页代码生成的复杂转换问题，其核心挑战在于如何精准捕捉语言指令中的视觉与布局语义，并映射为符合Web标准的HTML与CSS代码。构建过程中的挑战尤为突出，包括确保合成描述与真实界面渲染的一致性，避免描述遗漏界面中的文本或样式细节，以及维持生成描述在风格上的客观性与完整性。此外，数据规模的扩展与质量把控亦需平衡，以支撑模型对多样化网页元素与交互逻辑的泛化学习。

常用场景

经典使用场景

在网页设计与前端开发领域，WebSightDescribed数据集为自然语言到HTML代码的生成任务提供了关键资源。其经典使用场景集中于训练端到端的深度学习模型，通过输入描述用户界面的自然语言文本，直接输出对应的HTML与CSS代码。这种场景常应用于自动化网页构建的原型验证，模型能够依据文本描述生成视觉上准确的网页布局，从而简化前端开发的初始设计流程。

解决学术问题

该数据集有效解决了多模态人工智能中文本到结构化代码转换的学术难题。它弥合了自然语言理解与程序生成之间的语义鸿沟，为研究代码合成、视觉语言对齐以及人机交互界面自动化设计提供了基准。其意义在于推动了可解释的代码生成模型发展，降低了网页开发的技术门槛，并为跨模态表示学习提供了丰富的训练样本。

实际应用

在实际应用中，WebSightDescribed数据集支持快速原型工具和低代码平台的开发。设计师或产品经理可通过自然语言描述网页外观，系统自动生成可运行的HTML代码，大幅提升界面迭代效率。此外，该数据集还能辅助教育工具，帮助初学者直观理解网页结构与代码对应关系，或集成于智能助手，实现语音或文本驱动的网页创建。

数据集最近研究