WebWorldData

Name: WebWorldData
Creator: maas
Published: 2026-05-14 17:30:13
License: 暂无描述

魔搭社区2026-05-14 更新2026-05-10 收录

下载链接：

https://modelscope.cn/datasets/Qwen/WebWorldData

下载链接

链接失效反馈

官方服务：

资源简介：

# WebWorldData 🌐 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/LICENSE-2.0) [![GitHub](https://img.shields.io/badge/GitHub-WebWorld-4b32c3?logo=github)](https://github.com/QwenLM/WebWorld) [![Dataset](https://img.shields.io/badge/HF%20Dataset-WebWorldData-yellow?logo=huggingface)](https://huggingface.co/datasets/Qwen/WebWorldData) [![MS Dataset](https://img.shields.io/badge/ModelScope-Dataset-7B42BC)](https://modelscope.cn/datasets/Qwen/WebWorldData) [![8B](https://img.shields.io/badge/Model-8B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-8B) [![MS 8B](https://img.shields.io/badge/ModelScope-8B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-8B) [![14B](https://img.shields.io/badge/Model-14B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-14B) [![MS 14B](https://img.shields.io/badge/ModelScope-14B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-14B) [![32B](https://img.shields.io/badge/Model-32B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-32B) [![MS 32B](https://img.shields.io/badge/ModelScope-32B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-32B) ## Overview **WebWorldData** is a large-scale dataset of **1.06M web interaction trajectories** collected from the open web, designed for training browser world models. It is the training data behind the [WebWorld](https://github.com/QwenLM/WebWorld) model series. Each trajectory consists of sequences of `(state, action, next_state)` transitions, where states are represented as A11y Trees extracted from real websites using Playwright. ## Dataset Statistics | | | |---|---| | **Total Trajectories** | 1,059,348 | | **Total Size** | ~50.9 GB | | **Languages** | English, Chinese | | **Max Context Length** | 30K tokens | | **Max Trajectory Turns** | 30+ steps | | **State Format** | A11y Tree (primary), HTML, XML, Markdown, Natural Language | | **Source Websites** | 680K+ URLs from FineWeb, CCI 3.0, and curated lists | ## Data Sources The dataset is collected through a **scalable hierarchical pipeline**: | Source | Strategy | Scale | Description | |---|---|---|---| | **Level 1: Randomized Crawling** | Rule-based crawlers | 293K | Randomized exploration on websites from pre-training corpora (FineWeb, CCI 3.0), aligned with the base model's linguistic priors | | **Level 2: Autonomous Exploration** | LLM-driven agents | 38K | Agents autonomously explore websites by generating their own objectives, producing long-horizon trajectories up to 30 steps | | **Level 3: Task-Oriented Execution** | Synthetic tasks | 94K | Agents execute synthesized web tasks through seed extraction, diversification, and paraphrasing | | **Open Source** | AgentTrek, etc. | 38K | Reformatted open-source agent trajectories | | **Multi-Format** | Format conversion | 48K | Trajectories converted to HTML, XML, Markdown, Playwright formats | | **Interaction** | General QA + chat | 548K | General instruction-following and QA data to preserve conversational abilities | ## Data Format Each sample is a multi-turn conversation in JSONL format: ```json { "messages": [ { "role": "system", "content": "You are a web world model. I will provide you with an initial page state and a sequence of actions. For each action, predict the resulting page state.\nStrictly maintain the original format. Output only the full page state without explanations, code, or truncation." }, { "role": "user", "content": "Initial Page State:\nRootWebArea 'Example Site'\n\t[1] banner ...\n\nFirst Action: 'click([32])'\n\nNext Page State:" }, { "role": "assistant", "content": "RootWebArea 'Example Site - News'\n\t[1] banner ...\n\t[50] main ..." }, { "role": "user", "content": "Continue the trajectory. Given the previous state, predict the next page state after this action.\n\nAction: 'fill([19], \"weather today\")'\n\nNext Page State:" }, { "role": "assistant", "content": "RootWebArea 'Example Site - News'\n\t[1] banner ...\n\t[19] textbox ..., value='weather today' ..." } ] } ``` ## Domain Distribution The dataset covers diverse web domains: | Domain | Share | |---|---| | Technology | 15.2% | | E-Commerce / Shopping | 13.8% | | News & Media | 12.1% | | Education | 10.5% | | Entertainment | 9.3% | | Lifestyle | 8.7% | | Business & Finance | 7.9% | | Government & Public Services | 6.4% | | Health | 5.8% | | Other | 10.3% | ## Action Space Trajectories use a unified action space as Python-style function calls: | Category | Actions | |---|---| | **Element** | `click`, `fill`, `select_option`, `hover` | | **Mouse** | `mouse_move`, `mouse_click`, `mouse_down`, `mouse_up` | | **Keyboard** | `keyboard_press`, `keyboard_type` | | **Browser** | `scroll`, `goto`, `go_back`, `go_forward`, `tab_new`, `tab_close`, `tab_focus` | | **Meta** | `send_msg_to_user`, `noop`, `infeasible` | Action distribution: Element interactions (83.4%), Browser & navigation (11.9%), Meta & control (2.4%), Keyboard (1.2%), Coordinate & mouse (1.1%). ## Filtering & Safety The dataset undergoes rigorous dual-stage filtering: 1. **Rule-based filtering**: Website reachability checks, banned keyword filtering (pornography, gambling, violence), trajectory pruning for no-op transitions 2. **LLM-based URL filtering**: Each URL scored across accessibility, content suitability, interactivity, and engineering quality 3. **Trajectory-level filtering**: Max 30K tokens, max 30 turns, keyword safety checks All data is collected from publicly accessible webpages in compliance with `robots.txt` protocols. ## Usage ```python from datasets import load_dataset dataset = load_dataset("Qwen/WebWorldData") ``` ## Intended Use - Training browser world models for web simulation - Generating synthetic trajectories for web agent fine-tuning - Research on world modeling, environment simulation, and agent learning ## Limitations - Data is collected from publicly accessible webpages; residual PII may exist despite filtering - Web content is inherently non-deterministic (ads, A/B tests, dynamic widgets) — some trajectories may not be perfectly reproducible - Domain distribution reflects the composition of FineWeb and CCI 3.0 pre-training corpora ## Associated Models | Model | Link | |---|---| | WebWorld-8B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-8B) | | WebWorld-14B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-14B) | | WebWorld-32B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-32B) | ## Citation ```bibtex @misc{xiao2026webworldlargescaleworldmodel, title={WebWorld: A Large-Scale World Model for Web Agent Training}, author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu}, year={2026}, eprint={2602.14721}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.14721}, } ``` ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

# WebWorldData 🌐 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/LICENSE-2.0) [![GitHub](https://img.shields.io/badge/GitHub-WebWorld-4b32c3?logo=github)](https://github.com/QwenLM/WebWorld) [![Dataset](https://img.shields.io/badge/HF%20Dataset-WebWorldData-yellow?logo=huggingface)](https://huggingface.co/datasets/Qwen/WebWorldData) [![MS Dataset](https://img.shields.io/badge/ModelScope-Dataset-7B42BC)](https://modelscope.cn/datasets/Qwen/WebWorldData) [![8B](https://img.shields.io/badge/Model-8B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-8B) [![MS 8B](https://img.shields.io/badge/ModelScope-8B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-8B) [![14B](https://img.shields.io/badge/Model-14B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-14B) [![MS 14B](https://img.shields.io/badge/ModelScope-14B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-14B) [![32B](https://img.shields.io/badge/Model-32B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-32B) [![MS 32B](https://img.shields.io/badge/ModelScope-32B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-32B) ## 概览 **WebWorldData** 是一个包含105.93万条公开网页交互轨迹的大规模数据集，专为浏览器世界模型训练设计，是 [WebWorld](https://github.com/QwenLM/WebWorld) 模型系列的配套训练数据。每条轨迹均由 `(状态, 操作, 下一状态)` 的转换序列构成，其中状态通过 Playwright 从真实网站提取的**可访问性树（Accessibility Tree）**进行表示。 ## 数据集统计 | | | |---|---| | **总轨迹数** | 1,059,348 | | **总数据量** | 约50.9 GB | | **支持语言** | 英语、中文 | | **最大上下文长度** | 30,000个Token | | **最大轨迹轮次** | 30+ 步骤 | | **状态格式** | 可访问性树（首要格式）、HTML、XML、Markdown、自然语言 | | **源网站数量** | 来自FineWeb、CCI 3.0及精选列表的68万+ 个URL | ## 数据来源本数据集通过**可扩展分层流水线**采集： | 采集层级 | 采集策略 | 规模 | 描述 | |---|---|---|---| | **一级：随机爬取** | 基于规则的爬虫 | 29.3万条 | 从预训练语料（FineWeb、CCI 3.0）的网站中进行随机探索，与基础大语言模型的语言先验对齐 | | **二级：自主探索** | 大语言模型（Large Language Model, LLM）驱动的智能体（AI Agent） | 3.8万条 | 智能体通过自主生成目标探索网站，生成最长达30步的长时序轨迹 | | **三级：面向任务的执行** | 合成任务 | 9.4万条 | 智能体通过种子提取、多样化生成与释义，执行合成的网页任务 | | **开源来源** | AgentTrek 等开源项目 | 3.8万条 | 重新格式化的开源智能体轨迹 | | **多格式转换** | 格式转换工具 | 4.8万条 | 转换为HTML、XML、Markdown、Playwright格式的轨迹 | | **交互数据** | 通用问答与对话 | 54.8万条 | 通用指令遵循与问答数据，以保留智能体的对话能力 | ## 数据格式每条样本均为JSONL格式的多轮对话： json { "messages": [ { "role": "system", "content": "你是一个网页世界模型。我将为你提供初始页面状态与一系列操作，请针对每个操作预测对应的最终页面状态。严格保留原始格式，仅输出完整的页面状态，不得附带解释、代码或截断内容。" }, { "role": "user", "content": "初始页面状态： RootWebArea '示例网站' [1] 横幅 ... 首次操作：'click([32])' 下一页面状态：" }, { "role": "assistant", "content": "RootWebArea '示例网站 - 新闻' [1] 横幅 ... [50] 主内容区 ..." }, { "role": "user", "content": "继续该轨迹。基于此前的页面状态，预测执行以下操作后的下一页面状态。操作：'fill([19], "今日天气")' 下一页面状态：" }, { "role": "assistant", "content": "RootWebArea '示例网站 - 新闻' [1] 横幅 ... [19] 文本框 ..., 内容值='今日天气' ..." } ] } ## 领域分布本数据集覆盖多元网页领域： | 领域分类 | 占比 | |---|---| | 科技 | 15.2% | | 电商/购物 | 13.8% | | 新闻与媒体 | 12.1% | | 教育 | 10.5% | | 娱乐 | 9.3% | | 生活方式 | 8.7% | | 商业与金融 | 7.9% | | 政府与公共服务 | 6.4% | | 医疗 | 5.8% | | 其他 | 10.3% | ## 动作空间轨迹采用统一动作空间，格式为Python风格的函数调用： | 动作分类 | 具体动作 | |---|---| | **元素操作** | `click`（点击）、`fill`（填充）、`select_option`（选择选项）、`hover`（悬停） | | **鼠标操作** | `mouse_move`（鼠标移动）、`mouse_click`（鼠标点击）、`mouse_down`（鼠标按下）、`mouse_up`（鼠标释放） | | **键盘操作** | `keyboard_press`（按键按下）、`keyboard_type`（键盘输入） | | **浏览器操作** | `scroll`（滚动）、`goto`（跳转）、`go_back`（后退）、`go_forward`（前进）、`tab_new`（新建标签页）、`tab_close`（关闭标签页）、`tab_focus`（聚焦标签页） | | **元操作** | `send_msg_to_user`（向用户发送消息）、`noop`（无操作）、`infeasible`（不可执行） | 动作分布：元素交互（83.4%）、浏览器与导航（11.9%）、元操作与控制（2.4%）、键盘操作（1.2%）、坐标与鼠标操作（1.1%）。 ## 过滤与安全机制本数据集经过严格的双阶段过滤流程： 1. **基于规则的过滤**：网站可达性检查、违禁关键词过滤（涵盖色情、赌博、暴力内容）、剪除无操作转换的冗余轨迹 2. **基于大语言模型的URL过滤**：从可访问性、内容合规性、交互性与工程质量多个维度对URL进行评分 3. **轨迹级过滤**：限制最大上下文长度为30,000 Token、最大轨迹轮次为30步、执行关键词安全检查所有数据均来自公开可访问的网页，且严格遵循 `robots.txt` 协议。 ## 使用方式 python from datasets import load_dataset dataset = load_dataset("Qwen/WebWorldData") ## 预期用途 - 训练用于网页仿真的浏览器世界模型 - 生成用于网页智能体微调的合成轨迹 - 开展世界建模、环境仿真与智能体学习相关研究 ## 局限性 - 数据采集自公开可访问的网页，尽管经过过滤仍可能存在残余的个人可识别信息（Personal Identifiable Information, PII） - 网页内容固有非确定性（包含广告、A/B测试、动态组件），部分轨迹可能无法完全复现 - 领域分布反映了FineWeb与CCI 3.0预训练语料的构成比例 ## 关联模型 | 模型名称 | 链接 | |---|---| | WebWorld-8B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-8B) | | WebWorld-14B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-14B) | | WebWorld-32B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-32B) | ## 引用格式 bibtex @misc{xiao2026webworldlargescaleworldmodel, title={WebWorld: A Large-Scale World Model for Web Agent Training}, author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu}, year={2026}, eprint={2602.14721}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.14721}, } ## 许可证本数据集采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0) 发布。

提供机构：

maas

创建时间：

2026-04-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集