five

xcodemind/webcode2m_purified

收藏
Hugging Face2024-10-28 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/xcodemind/webcode2m_purified
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 size_categories: - 100B<n<1T task_categories: - image-to-text pretty_name: WebCode2M configs: - config_name: default data_files: - split: train path: data/*.parquet tags: - code --- WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs Features: - `image`: the screenshot of the webpage. - `bbox`: the layout information, i.e., the bounding boxes (Bbox) of all the elements in the webpage, which contains the size, position, and hierarchy information. - `text`: the webpage code text including HTML/CSS code. - `scale`: the scale of the screenshot, in the format [width, height]. - `lang`: the main language of the text content displayed on the rendered page (excluding HTML/CSS code). It is generated by a widely-applied [model](https://huggingface.co/papluca/xlm-roberta-base-language-detection) on HuggingFace, which achieved very high accuracy on its evaluation set. Currently, it supports the following 20 languages: arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh). - `tokens`: the count of tokens of HTML and CSS code, in the format of [CSS length, HTML length]. The tokens are generated by [GPT-2 tokenizer](https://huggingface.co/openai-community/gpt2). - `score`: the score is obtained by the neural scorer proposed in the paper. - `hash`: the hash code of the image object. **Warning**: This dataset is sourced from the internet and, despite filtering efforts, may still contain a small amount of inappropriate content, such as explicit material or violence. Users should exercise caution.
提供机构:
xcodemind
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作