WebExplorer-QA
收藏魔搭社区2026-01-07 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/WebExplorer-QA
下载链接
链接失效反馈官方服务:
资源简介:
# WebExplorer-QA Dataset
[Paper](https://huggingface.co/papers/2509.06501)
[](https://arxiv.org/abs/2509.06501)
[](LICENSE)
[](https://github.com/hkust-nlp/WebExplorer)
## Dataset Description
WebExplorer-QA is a challenging web navigation dataset designed for training long-horizon web agents from paper "WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents". The dataset is constructed through a novel two-stage approach: model-based exploration followed by iterative query evolution.
## Dataset Construction
### Stage 1: Model-Based Exploration
- Starting from seed entities collected from Wikipedia
- Iterative search and browsing actions to construct information spaces
- Initial QA pair generation requiring multi-website reasoning
### Stage 2: Iterative Query Evolution
- Long-to-short evolution by removing salient information
- Strategic obfuscation of dates, locations, and proper names
- 5 iterations of evolution to increase difficulty
## Data Format
Each sample contains:
```json
{
"query": "",
"answer": "",
"id": ""
}
```
## 📝 Citation
If you find our work useful, please consider citing:
```bibtex
@misc{liu2025webexplorer,
title={WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents},
author={Junteng Liu and Yunji Li and Chi Zhang and Jingyang Li and Aili Chen and Ke Ji and Weiyu Cheng and Zijia Wu and Chengyu Du and Qidi Xu and Jiayuan Song and Zhengmao Zhu and Wenhu Chen and Pengyu Zhao and Junxian He},
year={2025},
eprint={2509.06501},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06501},
}
```
**Note:** Due to company policy, only 100 high-quality samples of WebExplorer-QA are released for academic research and community testing. The full dataset is not publicly available at this time.
# WebExplorer-QA 数据集
[论文](https://huggingface.co/papers/2509.06501)
[](https://arxiv.org/abs/2509.06501)
[](LICENSE)
[](https://github.com/hkust-nlp/WebExplorer)
## 数据集概述
WebExplorer-QA 是一款面向长时序网页智能体(web agent)训练的高挑战性网页导航数据集,其相关研究成果收录于论文《WebExplorer:探索与演进,用于训练长时序网页智能体》(WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents)。该数据集采用一种新颖的两阶段流程构建:先执行基于模型的探索,随后进行迭代式查询演进。
## 数据集构建流程
### 第一阶段:基于模型的探索
- 从维基百科(Wikipedia)采集的种子实体出发
- 通过迭代式搜索与浏览操作构建信息空间
- 生成需要跨多网站推理的初始问答对
### 第二阶段:迭代式查询演进
- 通过移除关键信息实现从长到短的查询演进
- 对日期、地点与专有名词进行策略性混淆处理
- 执行5轮演进以提升任务难度
## 数据格式
每个样本的结构如下:
json
{
"query": "",
"answer": "",
"id": ""
}
## 📝 引用说明
若您的工作用到了本数据集,请引用如下文献:
bibtex
@misc{liu2025webexplorer,
title={WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents},
author={Junteng Liu and Yunji Li and Chi Zhang and Jingyang Li and Aili Chen and Ke Ji and Weiyu Cheng and Zijia Wu and Chengyu Du and Qidi Xu and Jiayuan Song and Zhengmao Zhu and Wenhu Chen and Pengyu Zhao and Junxian He},
year={2025},
eprint={2509.06501},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06501},
}
**注意:** 受公司政策限制,本次仅发布100份高质量的WebExplorer-QA数据集样本用于学术研究与社区测试,完整数据集暂未对外开放。
提供机构:
maas
创建时间:
2025-09-09



