WebWorldData
收藏魔搭社区2026-05-14 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/Qwen/WebWorldData
下载链接
链接失效反馈官方服务:
资源简介:
# WebWorldData 🌐
[](https://opensource.org/licenses/LICENSE-2.0)
[](https://github.com/QwenLM/WebWorld)
[](https://huggingface.co/datasets/Qwen/WebWorldData)
[](https://modelscope.cn/datasets/Qwen/WebWorldData)
[](https://huggingface.co/Qwen/WebWorld-8B)
[](https://modelscope.cn/models/Qwen/WebWorld-8B)
[](https://huggingface.co/Qwen/WebWorld-14B)
[](https://modelscope.cn/models/Qwen/WebWorld-14B)
[](https://huggingface.co/Qwen/WebWorld-32B)
[](https://modelscope.cn/models/Qwen/WebWorld-32B)
## Overview
**WebWorldData** is a large-scale dataset of **1.06M web interaction trajectories** collected from the open web, designed for training browser world models. It is the training data behind the [WebWorld](https://github.com/QwenLM/WebWorld) model series.
Each trajectory consists of sequences of `(state, action, next_state)` transitions, where states are represented as A11y Trees extracted from real websites using Playwright.
## Dataset Statistics
| | |
|---|---|
| **Total Trajectories** | 1,059,348 |
| **Total Size** | ~50.9 GB |
| **Languages** | English, Chinese |
| **Max Context Length** | 30K tokens |
| **Max Trajectory Turns** | 30+ steps |
| **State Format** | A11y Tree (primary), HTML, XML, Markdown, Natural Language |
| **Source Websites** | 680K+ URLs from FineWeb, CCI 3.0, and curated lists |
## Data Sources
The dataset is collected through a **scalable hierarchical pipeline**:
| Source | Strategy | Scale | Description |
|---|---|---|---|
| **Level 1: Randomized Crawling** | Rule-based crawlers | 293K | Randomized exploration on websites from pre-training corpora (FineWeb, CCI 3.0), aligned with the base model's linguistic priors |
| **Level 2: Autonomous Exploration** | LLM-driven agents | 38K | Agents autonomously explore websites by generating their own objectives, producing long-horizon trajectories up to 30 steps |
| **Level 3: Task-Oriented Execution** | Synthetic tasks | 94K | Agents execute synthesized web tasks through seed extraction, diversification, and paraphrasing |
| **Open Source** | AgentTrek, etc. | 38K | Reformatted open-source agent trajectories |
| **Multi-Format** | Format conversion | 48K | Trajectories converted to HTML, XML, Markdown, Playwright formats |
| **Interaction** | General QA + chat | 548K | General instruction-following and QA data to preserve conversational abilities |
## Data Format
Each sample is a multi-turn conversation in JSONL format:
```json
{
"messages": [
{
"role": "system",
"content": "You are a web world model. I will provide you with an initial page state and a sequence of actions. For each action, predict the resulting page state.\nStrictly maintain the original format. Output only the full page state without explanations, code, or truncation."
},
{
"role": "user",
"content": "Initial Page State:\nRootWebArea 'Example Site'\n\t[1] banner ...\n\nFirst Action: 'click([32])'\n\nNext Page State:"
},
{
"role": "assistant",
"content": "RootWebArea 'Example Site - News'\n\t[1] banner ...\n\t[50] main ..."
},
{
"role": "user",
"content": "Continue the trajectory. Given the previous state, predict the next page state after this action.\n\nAction: 'fill([19], \"weather today\")'\n\nNext Page State:"
},
{
"role": "assistant",
"content": "RootWebArea 'Example Site - News'\n\t[1] banner ...\n\t[19] textbox ..., value='weather today' ..."
}
]
}
```
## Domain Distribution
The dataset covers diverse web domains:
| Domain | Share |
|---|---|
| Technology | 15.2% |
| E-Commerce / Shopping | 13.8% |
| News & Media | 12.1% |
| Education | 10.5% |
| Entertainment | 9.3% |
| Lifestyle | 8.7% |
| Business & Finance | 7.9% |
| Government & Public Services | 6.4% |
| Health | 5.8% |
| Other | 10.3% |
## Action Space
Trajectories use a unified action space as Python-style function calls:
| Category | Actions |
|---|---|
| **Element** | `click`, `fill`, `select_option`, `hover` |
| **Mouse** | `mouse_move`, `mouse_click`, `mouse_down`, `mouse_up` |
| **Keyboard** | `keyboard_press`, `keyboard_type` |
| **Browser** | `scroll`, `goto`, `go_back`, `go_forward`, `tab_new`, `tab_close`, `tab_focus` |
| **Meta** | `send_msg_to_user`, `noop`, `infeasible` |
Action distribution: Element interactions (83.4%), Browser & navigation (11.9%), Meta & control (2.4%), Keyboard (1.2%), Coordinate & mouse (1.1%).
## Filtering & Safety
The dataset undergoes rigorous dual-stage filtering:
1. **Rule-based filtering**: Website reachability checks, banned keyword filtering (pornography, gambling, violence), trajectory pruning for no-op transitions
2. **LLM-based URL filtering**: Each URL scored across accessibility, content suitability, interactivity, and engineering quality
3. **Trajectory-level filtering**: Max 30K tokens, max 30 turns, keyword safety checks
All data is collected from publicly accessible webpages in compliance with `robots.txt` protocols.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Qwen/WebWorldData")
```
## Intended Use
- Training browser world models for web simulation
- Generating synthetic trajectories for web agent fine-tuning
- Research on world modeling, environment simulation, and agent learning
## Limitations
- Data is collected from publicly accessible webpages; residual PII may exist despite filtering
- Web content is inherently non-deterministic (ads, A/B tests, dynamic widgets) — some trajectories may not be perfectly reproducible
- Domain distribution reflects the composition of FineWeb and CCI 3.0 pre-training corpora
## Associated Models
| Model | Link |
|---|---|
| WebWorld-8B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-8B) |
| WebWorld-14B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-14B) |
| WebWorld-32B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-32B) |
## Citation
```bibtex
@misc{xiao2026webworldlargescaleworldmodel,
title={WebWorld: A Large-Scale World Model for Web Agent Training},
author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu},
year={2026},
eprint={2602.14721},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.14721},
}
```
## License
This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
# WebWorldData 🌐
[](https://opensource.org/licenses/LICENSE-2.0)
[](https://github.com/QwenLM/WebWorld)
[](https://huggingface.co/datasets/Qwen/WebWorldData)
[](https://modelscope.cn/datasets/Qwen/WebWorldData)
[](https://huggingface.co/Qwen/WebWorld-8B)
[](https://modelscope.cn/models/Qwen/WebWorld-8B)
[](https://huggingface.co/Qwen/WebWorld-14B)
[](https://modelscope.cn/models/Qwen/WebWorld-14B)
[](https://huggingface.co/Qwen/WebWorld-32B)
[](https://modelscope.cn/models/Qwen/WebWorld-32B)
## 概览
**WebWorldData** 是一个包含105.93万条公开网页交互轨迹的大规模数据集,专为浏览器世界模型训练设计,是 [WebWorld](https://github.com/QwenLM/WebWorld) 模型系列的配套训练数据。
每条轨迹均由 `(状态, 操作, 下一状态)` 的转换序列构成,其中状态通过 Playwright 从真实网站提取的**可访问性树(Accessibility Tree)**进行表示。
## 数据集统计
| | |
|---|---|
| **总轨迹数** | 1,059,348 |
| **总数据量** | 约50.9 GB |
| **支持语言** | 英语、中文 |
| **最大上下文长度** | 30,000个Token |
| **最大轨迹轮次** | 30+ 步骤 |
| **状态格式** | 可访问性树(首要格式)、HTML、XML、Markdown、自然语言 |
| **源网站数量** | 来自FineWeb、CCI 3.0及精选列表的68万+ 个URL |
## 数据来源
本数据集通过**可扩展分层流水线**采集:
| 采集层级 | 采集策略 | 规模 | 描述 |
|---|---|---|---|
| **一级:随机爬取** | 基于规则的爬虫 | 29.3万条 | 从预训练语料(FineWeb、CCI 3.0)的网站中进行随机探索,与基础大语言模型的语言先验对齐 |
| **二级:自主探索** | 大语言模型(Large Language Model, LLM)驱动的智能体(AI Agent) | 3.8万条 | 智能体通过自主生成目标探索网站,生成最长达30步的长时序轨迹 |
| **三级:面向任务的执行** | 合成任务 | 9.4万条 | 智能体通过种子提取、多样化生成与释义,执行合成的网页任务 |
| **开源来源** | AgentTrek 等开源项目 | 3.8万条 | 重新格式化的开源智能体轨迹 |
| **多格式转换** | 格式转换工具 | 4.8万条 | 转换为HTML、XML、Markdown、Playwright格式的轨迹 |
| **交互数据** | 通用问答与对话 | 54.8万条 | 通用指令遵循与问答数据,以保留智能体的对话能力 |
## 数据格式
每条样本均为JSONL格式的多轮对话:
json
{
"messages": [
{
"role": "system",
"content": "你是一个网页世界模型。我将为你提供初始页面状态与一系列操作,请针对每个操作预测对应的最终页面状态。严格保留原始格式,仅输出完整的页面状态,不得附带解释、代码或截断内容。"
},
{
"role": "user",
"content": "初始页面状态:
RootWebArea '示例网站'
[1] 横幅 ...
首次操作:'click([32])'
下一页面状态:"
},
{
"role": "assistant",
"content": "RootWebArea '示例网站 - 新闻'
[1] 横幅 ...
[50] 主内容区 ..."
},
{
"role": "user",
"content": "继续该轨迹。基于此前的页面状态,预测执行以下操作后的下一页面状态。
操作:'fill([19], "今日天气")'
下一页面状态:"
},
{
"role": "assistant",
"content": "RootWebArea '示例网站 - 新闻'
[1] 横幅 ...
[19] 文本框 ..., 内容值='今日天气' ..."
}
]
}
## 领域分布
本数据集覆盖多元网页领域:
| 领域分类 | 占比 |
|---|---|
| 科技 | 15.2% |
| 电商/购物 | 13.8% |
| 新闻与媒体 | 12.1% |
| 教育 | 10.5% |
| 娱乐 | 9.3% |
| 生活方式 | 8.7% |
| 商业与金融 | 7.9% |
| 政府与公共服务 | 6.4% |
| 医疗 | 5.8% |
| 其他 | 10.3% |
## 动作空间
轨迹采用统一动作空间,格式为Python风格的函数调用:
| 动作分类 | 具体动作 |
|---|---|
| **元素操作** | `click`(点击)、`fill`(填充)、`select_option`(选择选项)、`hover`(悬停) |
| **鼠标操作** | `mouse_move`(鼠标移动)、`mouse_click`(鼠标点击)、`mouse_down`(鼠标按下)、`mouse_up`(鼠标释放) |
| **键盘操作** | `keyboard_press`(按键按下)、`keyboard_type`(键盘输入) |
| **浏览器操作** | `scroll`(滚动)、`goto`(跳转)、`go_back`(后退)、`go_forward`(前进)、`tab_new`(新建标签页)、`tab_close`(关闭标签页)、`tab_focus`(聚焦标签页) |
| **元操作** | `send_msg_to_user`(向用户发送消息)、`noop`(无操作)、`infeasible`(不可执行) |
动作分布:元素交互(83.4%)、浏览器与导航(11.9%)、元操作与控制(2.4%)、键盘操作(1.2%)、坐标与鼠标操作(1.1%)。
## 过滤与安全机制
本数据集经过严格的双阶段过滤流程:
1. **基于规则的过滤**:网站可达性检查、违禁关键词过滤(涵盖色情、赌博、暴力内容)、剪除无操作转换的冗余轨迹
2. **基于大语言模型的URL过滤**:从可访问性、内容合规性、交互性与工程质量多个维度对URL进行评分
3. **轨迹级过滤**:限制最大上下文长度为30,000 Token、最大轨迹轮次为30步、执行关键词安全检查
所有数据均来自公开可访问的网页,且严格遵循 `robots.txt` 协议。
## 使用方式
python
from datasets import load_dataset
dataset = load_dataset("Qwen/WebWorldData")
## 预期用途
- 训练用于网页仿真的浏览器世界模型
- 生成用于网页智能体微调的合成轨迹
- 开展世界建模、环境仿真与智能体学习相关研究
## 局限性
- 数据采集自公开可访问的网页,尽管经过过滤仍可能存在残余的个人可识别信息(Personal Identifiable Information, PII)
- 网页内容固有非确定性(包含广告、A/B测试、动态组件),部分轨迹可能无法完全复现
- 领域分布反映了FineWeb与CCI 3.0预训练语料的构成比例
## 关联模型
| 模型名称 | 链接 |
|---|---|
| WebWorld-8B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-8B) |
| WebWorld-14B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-14B) |
| WebWorld-32B | [🤗 HuggingFace](https://huggingface.co/Qwen/WebWorld-32B) |
## 引用格式
bibtex
@misc{xiao2026webworldlargescaleworldmodel,
title={WebWorld: A Large-Scale World Model for Web Agent Training},
author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu},
year={2026},
eprint={2602.14721},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.14721},
}
## 许可证
本数据集采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0) 发布。
提供机构:
maas
创建时间:
2026-04-14



