WebShaper
收藏魔搭社区2026-01-07 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/iic/WebShaper
下载链接
链接失效反馈官方服务:
资源简介:
# WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

**Github:** https://github.com/Alibaba-NLP/WebAgent
**Paper:** https://arxiv.org/pdf/2507.15061
## TLTR
WebShaper is a synthesized training dataset for information-seeking (IS) task. It is based on our proposed task formalization of IS, and synthesized by our Expander Agent. WebShaper would cover a broader range of task forms, reasoning structure, and diversified knowledge.
## Description
The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities.
The scarcity of high-quality training data has limited the development of IS agents. Existing data synthesis approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, as well as between the question and the corresponding answer.
To mitigate, we propose a formalization-driven IS data synthesis framework, WebShaper, which systematically formalizes IS tasks using set-theoretic constructs. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization.
We release 500 data. More data is coming soon!
## How to use
Data fields:
● id: Unique id of each data.
● question: Synthesized question in natural language.
● formalization: formalization of the question in our list representation.
● answer: Answer for the question.
● urls: all urls for retrieved and used information for the question.
## 🚩 Citation
If this work is helpful, please kindly cite as:
```bigquery
@misc{tao2025webshaperagenticallydatasynthesizing,
title={WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization},
author={Zhengwei Tao and Jialong Wu and Wenbiao Yin and Junkai Zhang and Baixuan Li and Haiyang Shen and Kuan Li and Liwen Zhang and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou},
year={2025},
eprint={2507.15061},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.15061},
}
```
# WebShaper:基于信息搜索形式化的智能体式数据合成

**Github:** https://github.com/Alibaba-NLP/WebAgent
**Paper:** https://arxiv.org/pdf/2507.15061
## 简短摘要
WebShaper是一款面向信息搜索(Information-Seeking,IS)任务的合成训练数据集。其基于我们提出的信息搜索任务形式化框架,由我们的扩展智能体(Expander Agent)生成。WebShaper覆盖了更广泛的任务形式、推理结构与多样化知识范畴。
## 数据集说明
依托大语言模型(Large Language Model,LLM)的智能体,可通过网页端信息搜索(Information-Seeking,IS)能力解决复杂开放式任务,由此彻底重塑了人工智能领域的发展格局。
高质量训练数据的匮乏,掣肘了信息搜索智能体的研发进展。现有数据合成方法通常遵循信息驱动范式:先采集网页数据,再基于检索结果生成查询问题。然而此类方法易出现信息结构与推理结构不匹配,以及问题与对应答案间存在不一致的问题。
为缓解这一困境,我们提出了形式化驱动的信息搜索数据合成框架WebShaper,该框架通过集合论构造系统性地对信息搜索任务进行形式化定义。该形式化的核心是知识投影(Knowledge Projections,KP)概念,通过组合KP操作可实现对推理结构的精准管控。在数据合成阶段,我们首先构建种子任务,随后采用多步扩展流程。每一步中,扩展智能体(Expander Agent)将基于我们提出的形式化框架,借助检索与验证工具将当前形式化问题拓展为更复杂的形式。
目前我们已发布500条数据,更多数据集即将上线。
## 使用方法
数据字段说明:
● id:每条数据的唯一标识符。
● question:自然语言形式的合成查询问题。
● formalization:采用我们的列表表示法实现的问题形式化表述。
● answer:对应查询问题的答案。
● urls:该问题所使用的全部检索信息来源URL。
## 🚩 引用说明
若本研究对您有所帮助,请按以下格式引用:
bigquery
@misc{tao2025webshaperagenticallydatasynthesizing,
title={WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization},
author={Zhengwei Tao and Jialong Wu and Wenbiao Yin and Junkai Zhang and Baixuan Li and Haiyang Shen and Kuan Li and Liwen Zhang and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou},
year={2025},
eprint={2507.15061},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.15061},
}
提供机构:
maas
创建时间:
2025-07-16



