WebSight
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceM4/WebSight
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WebSight
## Dataset Description
WebSight is a large synthetic dataset containing HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot.
This dataset serves as a valuable resource for tasks such as generating UI codes from a screenshot.
It comes in two versions:
- v0.1: Websites are coded with HTML + CSS. They do not include real images.
- v0.2: Websites are coded with HTML + Tailwind CSS. They do include real images.
Essentially, here are the improvements in version v0.2, compared to version v0.1:
- Websites include real images (related to the context of the website!!)
- Usage of Tailwind CSS instead of traditional CSS
- Contains 2x more examples
- Contains more tables
- Better resolution for the screenshots
- Presence of a column indicating the LLM-generated idea to create the websites
<details>
<summary>Details for WebSight-v0.1 (HTML + CSS)</summary>
## Data Fields
An example of a sample appears as follows:
```
{
'images': PIL.Image,
'text': '<html>\n<style>\n{css}</style>\n{body}\n</html>',
}
```
where `css` is the CSS code, and `body` is the body of the HTML code.
In other words, the CSS code is embedded directly within the HTML code, facilitating the straightforward training of a model.
## Data Splits
There is only one split, `train`, that contains 822,987 images and codes.
## Dataset Creation
This dataset was created using [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) to generate random website ideas with the prompt
```
Generate diverse website layout ideas for different companies, each with a unique design element.
Examples include: a car company site with a left column, a webpage footer with a centered logo.
Explore variations in colors, positions, and company fields.
Don't give any explanations or recognition that you have understood the request,
just give the list of 10 ideas, with a line break between each.
```
which were then passed to [Deepseek-Coder-33b-Instruct](https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct) with the prompt
```
Create a very SIMPLE and SHORT website with the following elements: {idea}
Be creative with the design, size, position of the elements, columns, etc...
Don't give any explanation, just the content of the HTML code `index.html` starting with `<!DOCTYPE html>`,
followed by the CSS code `styles.css` starting with `/* Global Styles */`.
Write real and short sentences for the paragraphs, don't use Lorem ipsum.
When you want to display an image, don't use <img> in the HTML, always display a colored rectangle instead.
```
Following these steps, the HTML and CSS codes were extracted from the outputs of Deepseek-Coder and formatted into the structure `'<html>\n<style>\n{css}</style>\n{body}\n</html>'`.
</details>
<details>
<summary>Details for WebSight-v0.2 (HTML + Tailwind CSS)</summary>
## Data Fields
An example of a sample appears as follows:
```
{
'images': PIL.Image,
'text': '<html>\n<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet">\n{body}\n</html>',
}
```
where `body` is the body of the HTML code, containing directly the Tailwind CSS code and facilitating the straightforward training of a model.
## Data Splits
There is only one split, `train`, that contains TO DO images and codes.
## Dataset Creation
TO DO.
For now, the creation of the dataset is documented in the technical report.
</details>
## Terms of Use
By using the dataset, you agree to comply with the original licenses of the source content as well as the dataset license (CC-BY-4.0). Additionally, if you use this dataset to train a Machine Learning model, you agree to disclose your use of the dataset when releasing the model or an ML application using the model.
### Licensing Information
License CC-BY-4.0.
### Citation Information
If you are using this dataset, please cite our [technical report](https://arxiv.org/abs/2403.09029)
```
@misc{laurençon2024unlocking,
title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset},
author={Hugo Laurençon and Léo Tronchon and Victor Sanh},
year={2024},
eprint={2403.09029},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
```
# WebSight 数据集卡片
## 数据集描述
WebSight是一个大型合成数据集,包含代表合成生成的英文网站的HTML/CSS代码,每个样本均附带对应的截图。
该数据集可作为从截图生成UI代码等任务的宝贵资源。
数据集包含两个版本:
- v0.1:网站采用HTML + CSS编码,不包含真实图像。
- v0.2:网站采用HTML + Tailwind CSS编码,包含真实图像。
相较于v0.1版本,v0.2版本的改进如下:
- 网站包含与网站上下文相关的真实图像
- 采用Tailwind CSS替代传统CSS
- 样本数量为v0.1的2倍
- 包含更多表格
- 截图分辨率更高
- 新增一列用于标注生成该网站的大语言模型(Large Language Model,LLM)创作思路
<details>
<summary>WebSight-v0.1(HTML + CSS)详情</summary>
## 数据字段
样本示例格式如下:
{
'images': PIL.Image,
'text': '<html>
<style>
{css}</style>
{body}
</html>',
}
其中`css`为CSS代码,`body`为HTML代码的主体部分。换言之,CSS代码直接内嵌于HTML代码中,可便捷用于模型训练。
## 数据划分
仅包含一个划分集`train`,包含822,987个图像与代码样本。
## 数据集构建
本数据集通过[Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)生成随机网站创意,提示词如下:
为不同公司生成多样化的网站布局创意,每个创意都带有独特的设计元素。
示例包括:带有左侧栏的汽车公司网站、带有居中logo的网页页脚。
探索颜色、布局位置以及公司领域的多样性。
无需任何解释或确认你已理解该请求,仅需给出10个创意的列表,每个创意之间用换行分隔。
随后将这些创意传入[Deepseek-Coder-33b-Instruct](https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct),提示词如下:
基于以下元素创建一个极其简洁短小的网站:{idea}
在元素、栏目的设计、尺寸、位置等方面发挥创意。
无需任何解释,仅需输出以`<!DOCTYPE html>`开头的HTML代码`index.html`内容,随后是以`/* Global Styles */`开头的CSS代码`styles.css`内容。
为段落编写真实且简短的语句,请勿使用Lorem ipsum占位文本。
当需要显示图像时,请勿在HTML中使用<img>标签,一律使用彩色矩形替代。
按照上述步骤,研究人员从Deepseek-Coder的输出中提取HTML与CSS代码,并格式化为`'<html>
<style>
{css}</style>
{body}
</html>'`的结构。
</details>
<details>
<summary>WebSight-v0.2(HTML + Tailwind CSS)详情</summary>
## 数据字段
样本示例格式如下:
{
'images': PIL.Image,
'text': '<html>
<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet">
{body}
</html>',
}
其中`body`为HTML代码的主体部分,直接包含Tailwind CSS代码,可便捷用于模型训练。
## 数据划分
仅包含一个划分集`train`,样本数量标注为“TO DO”。
## 数据集构建
标注为“TO DO”。目前该数据集的构建流程已在技术报告中说明。
</details>
## 使用条款
使用本数据集即表示您同意遵守源内容的原始许可协议以及本数据集的许可协议(CC-BY-4.0)。此外,若您使用本数据集训练机器学习模型,则在发布该模型或基于该模型开发的机器学习应用时,需公开说明使用了本数据集。
### 许可信息
许可协议为CC-BY-4.0。
### 引用信息
若使用本数据集,请引用我们的[技术报告](https://arxiv.org/abs/2403.09029)
@misc{laurençon2024unlocking,
title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset},
author={Hugo Laurençon and Léo Tronchon and Victor Sanh},
year={2024},
eprint={2403.09029},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
提供机构:
maas
创建时间:
2025-08-01



