WebLeaper

Name: WebLeaper
Creator: maas
Published: 2026-01-04 16:54:11
License: 暂无描述

魔搭社区2026-01-04 更新2026-01-03 收录

下载链接：

https://modelscope.cn/datasets/iic/WebLeaper

下载链接

链接失效反馈

官方服务：

资源简介：

📦 Dataset & Task Design WebLeaper constructs IS tasks from curated Wikipedia tables and cross-table unions: * **Tree-Structured IS:** * Root (question entity) $\rightarrow$ 2nd layer (key entities) $\rightarrow$ 3rd layer (attributes/linked entities) * Each 2nd-layer node + its attributes forms a subtree; tasks require retrieving final and intermediate entities. * **Variants** 1. **`Basic`:** Build a single-source tree from one well-formed table; dense targets in a constrained context. 2. **`Union`:** Detect maximal unions among trees that share relations (modeled as maximal biclique enumeration) to create multi-source synthesis questions. 3. **`Reverse-Union`:** Provide attribute-level clues to deduce a hidden anchor entity first, then pivot (e.g., nationality) to launch a union-style search. > **Result:** Tasks that reward efficient exploration, resist keyword shortcuts, and stabilize metric estimation as the target count ($n$) grows. ----- 📐 Metrics: Measuring Coverage & Efficiency * **Information-Seeking Rate (ISR):** Fraction of required entities retrieved. $$ \mathrm{ISR} = \frac{|R\cap O|}{|R|} $$ * **Information-Seeking Efficiency (ISE):** Target entities discovered per action step. $$ \mathrm{ISE} = \frac{n}{T} $$ * **Stability:** As the number of targets $n$ increases, $\mathrm{Var}(\mathrm{ISE}) = \mathcal{O}(1/n)$, yielding reliable efficiency signals during training. ----- 🔍 Method Details 1. **`Basic` (Single-Source, Dense)** * Mine large, homogeneous Wikipedia tables. * Root from table title; primary key columns $\rightarrow$ 2nd-layer; other columns $\rightarrow$ 3rd-layer attributes. * Build compact, high-coverage tasks that maximize valid actions. 2. **`Union` (Multi-Source, Structured)** * Identify maximal unions between trees sharing relation sets (e.g., `has_nationality`, `has_name`). * Synthesize questions that require intersection/union across sources (e.g., “authors who won both Prize A and Prize B”). 3. **`Reverse-Union` (Deduction $\rightarrow$ Expansion)** * Provide fuzzed clues at the attribute level to force anchor deduction (no direct keywords). * Use a pivot attribute (e.g., country) from the deduced anchor to launch a new `Union`-style search over other trees. ----- ## 🚩 Citation If this work is helpful, please kindly cite as: ```bigquery @misc{tao2025webleaper, title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking}, author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others}, year={2025}, eprint={2510.24697}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.24697}, } ```

📦 数据集与任务设计 WebLeaper 基于精选的维基百科表格及跨表联合构建信息检索（Information-Seeking, IS）任务： * **树结构信息检索任务：** * 根节点（问题实体）→ 第二层（关键实体）→ 第三层（属性/关联实体） * 每个第二层节点及其属性构成一棵子树；任务要求检索最终实体与中间实体。 * **任务变体** 1. **`Basic`（基础版）：** 从单个规范表格构建单源树；在受限上下文下生成稠密目标。 2. **`Union`（联合版）：** 检测共享关系的树之间的极大联合（建模为极大二部团枚举），以生成多源合成问题。 3. **`Reverse-Union`（反向联合版）：** 提供属性级线索，先推导隐藏的锚定实体，再通过枢纽（如国籍）发起联合风格的检索。 > **任务效果：** 此类任务能够激励高效探索，抵御关键词捷径，并在目标数量（$n$）增长时稳定指标估算。 ----- 📐 评估指标：覆盖率与效率 * **信息检索率（Information-Seeking Rate, ISR）：** 已检索到的所需实体占总需检索实体的比例。 $$mathrm{ISR} = frac{|Rcap O|}{|R|}$$ * **信息检索效率（Information-Seeking Efficiency, ISE）：** 每操作步骤发现的目标实体数。 $$mathrm{ISE} = frac{n}{T}$$ * **稳定性：** 随着目标数量$n$增加，$mathrm{Var}(mathrm{ISE}) = mathcal{O}(1/n)$，在训练过程中可提供可靠的效率信号。 ----- 🔍 方法细节 1. **`Basic`（单源稠密版）** * 挖掘大规模同质化维基百科表格。 * 以表格标题作为根节点；主键列对应第二层节点；其余列对应第三层属性。 * 构建紧凑、高覆盖率的任务，最大化有效操作空间。 2. **`Union`（多源结构化版）** * 识别共享关系集合（如`has_nationality`、`has_name`）的树之间的极大联合。 * 合成需要跨源交集/并集操作的问题（例如："同时获得奖项A与奖项B的作者"）。 3. **`Reverse-Union`（推导→扩展版）** * 提供属性级模糊线索，强制进行锚定实体推导（无直接关键词）。 * 从推导得到的锚定实体中选取枢纽属性（如国家），在其他树集合上发起全新的`Union`风格检索。 ----- 🚩 引用声明若本研究对你有所帮助，请引用如下文献： bigquery @misc{tao2025webleaper, title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking}, author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others}, year={2025}, eprint={2510.24697}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.24697}, }

提供机构：

maas

创建时间：

2025-10-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集