WebLeaper
收藏魔搭社区2026-01-04 更新2026-01-03 收录
下载链接:
https://modelscope.cn/datasets/iic/WebLeaper
下载链接
链接失效反馈官方服务:
资源简介:
📦 Dataset & Task Design
WebLeaper constructs IS tasks from curated Wikipedia tables and cross-table unions:
* **Tree-Structured IS:**
* Root (question entity) $\rightarrow$ 2nd layer (key entities) $\rightarrow$ 3rd layer (attributes/linked entities)
* Each 2nd-layer node + its attributes forms a subtree; tasks require retrieving final and intermediate entities.
* **Variants**
1. **`Basic`:** Build a single-source tree from one well-formed table; dense targets in a constrained context.
2. **`Union`:** Detect maximal unions among trees that share relations (modeled as maximal biclique enumeration) to create multi-source synthesis questions.
3. **`Reverse-Union`:** Provide attribute-level clues to deduce a hidden anchor entity first, then pivot (e.g., nationality) to launch a union-style search.
> **Result:** Tasks that reward efficient exploration, resist keyword shortcuts, and stabilize metric estimation as the target count ($n$) grows.
-----
📐 Metrics: Measuring Coverage & Efficiency
* **Information-Seeking Rate (ISR):** Fraction of required entities retrieved.
$$
\mathrm{ISR} = \frac{|R\cap O|}{|R|}
$$
* **Information-Seeking Efficiency (ISE):** Target entities discovered per action step.
$$
\mathrm{ISE} = \frac{n}{T}
$$
* **Stability:** As the number of targets $n$ increases, $\mathrm{Var}(\mathrm{ISE}) = \mathcal{O}(1/n)$, yielding reliable efficiency signals during training.
-----
🔍 Method Details
1. **`Basic` (Single-Source, Dense)**
* Mine large, homogeneous Wikipedia tables.
* Root from table title; primary key columns $\rightarrow$ 2nd-layer; other columns $\rightarrow$ 3rd-layer attributes.
* Build compact, high-coverage tasks that maximize valid actions.
2. **`Union` (Multi-Source, Structured)**
* Identify maximal unions between trees sharing relation sets (e.g., `has_nationality`, `has_name`).
* Synthesize questions that require intersection/union across sources (e.g., “authors who won both Prize A and Prize B”).
3. **`Reverse-Union` (Deduction $\rightarrow$ Expansion)**
* Provide fuzzed clues at the attribute level to force anchor deduction (no direct keywords).
* Use a pivot attribute (e.g., country) from the deduced anchor to launch a new `Union`-style search over other trees.
-----
## 🚩 Citation
If this work is helpful, please kindly cite as:
```bigquery
@misc{tao2025webleaper,
title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking},
author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others},
year={2025},
eprint={2510.24697},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.24697},
}
```
📦 数据集与任务设计
WebLeaper 基于精选的维基百科表格及跨表联合构建信息检索(Information-Seeking, IS)任务:
* **树结构信息检索任务:**
* 根节点(问题实体)→ 第二层(关键实体)→ 第三层(属性/关联实体)
* 每个第二层节点及其属性构成一棵子树;任务要求检索最终实体与中间实体。
* **任务变体**
1. **`Basic`(基础版):** 从单个规范表格构建单源树;在受限上下文下生成稠密目标。
2. **`Union`(联合版):** 检测共享关系的树之间的极大联合(建模为极大二部团枚举),以生成多源合成问题。
3. **`Reverse-Union`(反向联合版):** 提供属性级线索,先推导隐藏的锚定实体,再通过枢纽(如国籍)发起联合风格的检索。
> **任务效果:** 此类任务能够激励高效探索,抵御关键词捷径,并在目标数量($n$)增长时稳定指标估算。
-----
📐 评估指标:覆盖率与效率
* **信息检索率(Information-Seeking Rate, ISR):** 已检索到的所需实体占总需检索实体的比例。
$$mathrm{ISR} = frac{|Rcap O|}{|R|}$$
* **信息检索效率(Information-Seeking Efficiency, ISE):** 每操作步骤发现的目标实体数。
$$mathrm{ISE} = frac{n}{T}$$
* **稳定性:** 随着目标数量$n$增加,$mathrm{Var}(mathrm{ISE}) = mathcal{O}(1/n)$,在训练过程中可提供可靠的效率信号。
-----
🔍 方法细节
1. **`Basic`(单源稠密版)**
* 挖掘大规模同质化维基百科表格。
* 以表格标题作为根节点;主键列对应第二层节点;其余列对应第三层属性。
* 构建紧凑、高覆盖率的任务,最大化有效操作空间。
2. **`Union`(多源结构化版)**
* 识别共享关系集合(如`has_nationality`、`has_name`)的树之间的极大联合。
* 合成需要跨源交集/并集操作的问题(例如:"同时获得奖项A与奖项B的作者")。
3. **`Reverse-Union`(推导→扩展版)**
* 提供属性级模糊线索,强制进行锚定实体推导(无直接关键词)。
* 从推导得到的锚定实体中选取枢纽属性(如国家),在其他树集合上发起全新的`Union`风格检索。
-----
🚩 引用声明
若本研究对你有所帮助,请引用如下文献:
bigquery
@misc{tao2025webleaper,
title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking},
author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others},
year={2025},
eprint={2510.24697},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.24697},
}
提供机构:
maas
创建时间:
2025-10-28



