five

WebLeaper

收藏
魔搭社区2026-01-04 更新2026-01-03 收录
下载链接:
https://modelscope.cn/datasets/iic/WebLeaper
下载链接
链接失效反馈
官方服务:
资源简介:
📦 Dataset & Task Design WebLeaper constructs IS tasks from curated Wikipedia tables and cross-table unions: * **Tree-Structured IS:** * Root (question entity) $\rightarrow$ 2nd layer (key entities) $\rightarrow$ 3rd layer (attributes/linked entities) * Each 2nd-layer node + its attributes forms a subtree; tasks require retrieving final and intermediate entities. * **Variants** 1. **`Basic`:** Build a single-source tree from one well-formed table; dense targets in a constrained context. 2. **`Union`:** Detect maximal unions among trees that share relations (modeled as maximal biclique enumeration) to create multi-source synthesis questions. 3. **`Reverse-Union`:** Provide attribute-level clues to deduce a hidden anchor entity first, then pivot (e.g., nationality) to launch a union-style search. > **Result:** Tasks that reward efficient exploration, resist keyword shortcuts, and stabilize metric estimation as the target count ($n$) grows. ----- 📐 Metrics: Measuring Coverage & Efficiency * **Information-Seeking Rate (ISR):** Fraction of required entities retrieved. $$ \mathrm{ISR} = \frac{|R\cap O|}{|R|} $$ * **Information-Seeking Efficiency (ISE):** Target entities discovered per action step. $$ \mathrm{ISE} = \frac{n}{T} $$ * **Stability:** As the number of targets $n$ increases, $\mathrm{Var}(\mathrm{ISE}) = \mathcal{O}(1/n)$, yielding reliable efficiency signals during training. ----- 🔍 Method Details 1. **`Basic` (Single-Source, Dense)** * Mine large, homogeneous Wikipedia tables. * Root from table title; primary key columns $\rightarrow$ 2nd-layer; other columns $\rightarrow$ 3rd-layer attributes. * Build compact, high-coverage tasks that maximize valid actions. 2. **`Union` (Multi-Source, Structured)** * Identify maximal unions between trees sharing relation sets (e.g., `has_nationality`, `has_name`). * Synthesize questions that require intersection/union across sources (e.g., “authors who won both Prize A and Prize B”). 3. **`Reverse-Union` (Deduction $\rightarrow$ Expansion)** * Provide fuzzed clues at the attribute level to force anchor deduction (no direct keywords). * Use a pivot attribute (e.g., country) from the deduced anchor to launch a new `Union`-style search over other trees. ----- ## 🚩 Citation If this work is helpful, please kindly cite as: ```bigquery @misc{tao2025webleaper, title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking}, author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others}, year={2025}, eprint={2510.24697}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.24697}, } ```

📦 数据集与任务设计 WebLeaper 基于精选的维基百科表格及跨表联合构建信息检索(Information-Seeking, IS)任务: * **树结构信息检索任务:** * 根节点(问题实体)→ 第二层(关键实体)→ 第三层(属性/关联实体) * 每个第二层节点及其属性构成一棵子树;任务要求检索最终实体与中间实体。 * **任务变体** 1. **`Basic`(基础版):** 从单个规范表格构建单源树;在受限上下文下生成稠密目标。 2. **`Union`(联合版):** 检测共享关系的树之间的极大联合(建模为极大二部团枚举),以生成多源合成问题。 3. **`Reverse-Union`(反向联合版):** 提供属性级线索,先推导隐藏的锚定实体,再通过枢纽(如国籍)发起联合风格的检索。 > **任务效果:** 此类任务能够激励高效探索,抵御关键词捷径,并在目标数量($n$)增长时稳定指标估算。 ----- 📐 评估指标:覆盖率与效率 * **信息检索率(Information-Seeking Rate, ISR):** 已检索到的所需实体占总需检索实体的比例。 $$mathrm{ISR} = frac{|Rcap O|}{|R|}$$ * **信息检索效率(Information-Seeking Efficiency, ISE):** 每操作步骤发现的目标实体数。 $$mathrm{ISE} = frac{n}{T}$$ * **稳定性:** 随着目标数量$n$增加,$mathrm{Var}(mathrm{ISE}) = mathcal{O}(1/n)$,在训练过程中可提供可靠的效率信号。 ----- 🔍 方法细节 1. **`Basic`(单源稠密版)** * 挖掘大规模同质化维基百科表格。 * 以表格标题作为根节点;主键列对应第二层节点;其余列对应第三层属性。 * 构建紧凑、高覆盖率的任务,最大化有效操作空间。 2. **`Union`(多源结构化版)** * 识别共享关系集合(如`has_nationality`、`has_name`)的树之间的极大联合。 * 合成需要跨源交集/并集操作的问题(例如:"同时获得奖项A与奖项B的作者")。 3. **`Reverse-Union`(推导→扩展版)** * 提供属性级模糊线索,强制进行锚定实体推导(无直接关键词)。 * 从推导得到的锚定实体中选取枢纽属性(如国家),在其他树集合上发起全新的`Union`风格检索。 ----- 🚩 引用声明 若本研究对你有所帮助,请引用如下文献: bigquery @misc{tao2025webleaper, title={Webleaper: Empowering efficiency and efficacy in webagent via enabling info-rich seeking}, author={Tao, Zhengwei and Shen, Haiyang and Li, Baixuan and Yin, Wenbiao and Wu, Jialong and Li, Kuan and Zhang, Zhongwang and Yin, Huifeng and Ye, Rui and Zhang, Liwen and others}, year={2025}, eprint={2510.24697}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.24697}, }
提供机构:
maas
创建时间:
2025-10-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作