five

InfoSeek

收藏
魔搭社区2026-05-17 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/InfoSeek
下载链接
链接失效反馈
官方服务:
资源简介:
# InfoSeek: Open Data Synthesis For Deep Research [Paper](https://huggingface.co/papers/2509.00375) | [Code](https://github.com/VectorSpaceLab/InfoSeek) ## Dataset Information * **`data/InfoSeek.jsonl`** Contains the full research tree structures of *InfoSeek*. Each sample starts from a root node with a research question, its corresponding entity, and process information for sub-questions (stored in `root`). Also expands into intermediate tree structure during each step of construction (stored in `all_tree_list`). Totally 52K samples. * **`data/InfoSeekQA.jsonl`** A collection of QA pairs derived from *InfoSeek*. Each entry corresponds to the final question (`sample['root']['question']`) and its answer entity (`sample['root']['entity']`) in `InfoSeek.jsonl`. * **`data/InfoSeek-Hard-18K.jsonl`** A challenging subset of *InfoSeek* (18K samples), which is better to conduct end-to-end RL, identified using an LLM with a dedicated prompt for complex deep research. * **`data/Trajectory-RFT-17K.jsonl`** Contains 17K reasoning trajectories generated through the workflow described in our paper. These can be used as training data for supervised fine-tuning (SFT). ## Abstract Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. ## 🔆 Overview We propose **InfoSeek**, a scalable data synthesis framework for constructing structurally complex Deep Research tasks. InfoSeek designs a dual-agent system to recursively build a *Research Tree* by mining entities and relations from large-scale text, and blurring itermediate vertices to ensure they form valid sub-problems. The agent then transform these trees into natural language questions whose solutions require traversing the entire hierarchy. Using InfoSeek pipeline, we construct a high-quality, complexity-controllable, and intrinsically verifiable dataset. ### Example 1: **Question:** What is a species of bird that was named by a person employed under his father between 1818 and 1824, whose wife was a British artist, and which has three subspecies and body length is generally no more than 6 inches? **Answer:** Russet sparrow <details> <summary>Tree Structure</summary> ``` { "root": { "id": "A", "entity": "Russet sparrow", "question": "What is a species of bird that was named by a person employed under his father between 1818 and 1824, whose wife was a British artist, and which has three subspecies and body length is generally no more than 6 inches?", "claims": [ { "target_id": "B", "claim": "A was named by B" }, { "target_id": "C", "claim": "A has three subspecies" }, { "target_id": "D", "claim": "A's body length is generally no more than 6 inches" } ],\ "children": [ { "id": "B", "entity": "John Gould", "claims": [ { "target_id": "E", "claim": "B was employed by his father between 1818 and 1824" }, { "target_id": "F", "claim": "B's wife was F" } ],\ "children": [ { "id": "E", "entity": "None", "claims": [], "children": [] }, { "id": "F", "entity": "Elizabeth Gould", "claims": [], "children": [] } ] },\ { "id": "C", "entity": "None", "claims": [], "children": [] }, { "id": "D", "entity": "None", "claims": [], "children": [] } ] } } ``` ``` (A: Russet sparrow) │ │ │── [claim] "was named by" ──> (B: John Gould) │ │ │ │ │ │── [claim] "was employed by his father (1818-1824)" │ │ │ │ │ │── [claim] "wife was" ──> (F: Elizabeth Gould) │ │ │── [claim] "has three subspecies" │ │ │── [claim] "body length is generally no more than 6 inches" ``` </details> ### Example 2: **Question:** What is a women's football team whose first goals in the 2. Bundesliga were scored by a player born in Korogocho, who was discovered and developed by the Mathare Youth Sports Association? **Answer:** SV Werder Bremen (women) <details> <summary>Tree Structure</summary> ``` { "root": { "id": "A", "entity": "SV Werder Bremen (women)", "question": "What is a women's football team whose first goals in the 2. Bundesliga were scored by a player born in Korogocho, who was discovered and developed by the Mathare Youth Sports Association?", "claims": [ { "target_id": "B", "claim": "A's first goals in the 2. Bundesliga were scored by B" } ],\ "children": [ { "id": "B", "entity": "Doreen Nabwire", "claims": [ { "target_id": "C", "claim": "B was discovered and developed by C" }, { "target_id": "D", "claim": "B was born in D" } ],\ "children": [ { "id": "C", "entity": "Mathare Youth Sports Association", "claims": [], "children": [] }, { "id": "D", "entity": "Korogocho", "claims": [], "children": [] } ] } ] } } ``` ``` (A: SV Werder Bremen (women)) │ │ │── [claim] "first goals scored by" ──> (B: Doreen Nabwire) │ │ │── [claim] "discovered and developed by" ──> (C:Mathare Youth Sports Association) │ │ │── [claim] "was born in" ──> (D: Korogocho) ``` </details> ## 📊 Performance Model trained on InfoSeek and our framework shows strong performances on traditional multi-hop benchmarks: <img src="https://github.com/VectorSpaceLab/InfoSeek/raw/main/assets/results.png" width="800"> Our 3B model shows competitive results on [BrowseComp-Plus](https://github.com/texttron/BrowseComp-Plus): <img src="https://github.com/VectorSpaceLab/InfoSeek/raw/main/assets/browsecomp_plus.png" width="800"> ## ❤️ Citing Us If you find this repository or our work useful, please consider giving a star ⭐ and or citing our work, which would be greatly appreciated: ```bibtex @misc{xia2025opendatasynthesisdeep, title={Open Data Synthesis For Deep Research}, author={Ziyi Xia and Kun Luo and Hongjin Qian and Zheng Liu}, year={2025},\ eprint={2509.00375}, archivePrefix={arXiv}, primaryClass={cs.CL},\ url={https://arxiv.org/abs/2509.00375}, } ```

# InfoSeek:面向深度研究的开源数据合成框架 [Paper](https://huggingface.co/papers/2509.00375) | [Code](https://github.com/VectorSpaceLab/InfoSeek) ## 数据集信息 * **`data/InfoSeek.jsonl`** 包含InfoSeek的完整研究树结构。每个样本以根节点起始,该节点包含研究问题、对应的实体,以及子问题的处理信息(存储于`"root"`字段)。同时在构建的每一步会生成中间树结构(存储于`"all_tree_list"`字段),总计包含52K条样本。 * **`data/InfoSeekQA.jsonl`** 是从InfoSeek衍生的问答对集合。每个条目对应`InfoSeek.jsonl`中的最终问题(`sample["root"]["question"]`)及其答案实体(`sample["root"]["entity"]`)。 * **`data/InfoSeek-Hard-18K.jsonl`** 是InfoSeek的高难度子集(18K条样本),更适合开展端到端强化学习(RL),通过使用带专用提示词的大语言模型(LLM)筛选出复杂深度研究任务构建而成。 * **`data/Trajectory-RFT-17K.jsonl`** 包含17K条通过本文所述流程生成的推理轨迹,可作为监督微调(SFT)的训练数据。 ## 摘要 大语言模型(LLM)的应用场景正从简单事实查询,逐步拓展至**深度研究**任务——这类任务需要将问题拆解为子问题、协调多步推理,并从多源信息中综合证据。我们将具备可验证答案的深度研究任务形式化为**层级约束满足问题(HCSPs)**,其与单约束、多跳或扁平约束满足问题(CSP)的表述有着本质区别。然而现有基准(如Natural Questions、HotpotQA)无法体现这种复杂性,而近期的合成数据集往往存在捷径推理、知识泄露或结构深度不足的问题。为填补这一空白,我们提出InfoSeek:一个可扩展的复杂深度研究任务合成框架。InfoSeek采用双智能体系统,从大规模网页中递归构建研究树,将中间节点转化为合法子问题,并将这些树转换为需要遍历完整层级才能解答的自然语言问题。该框架还支持快速扩展,现已生成超过50K条训练样本、精选测试集,以及通过拒绝采样生成的推理轨迹。实验表明,在InfoSeek上训练的模型始终优于强劲基线。在极具挑战性的基准数据集BrowseComp-Plus上,使用InfoSeek优化的3B参数大语言模型,性能超越了规模大得多的32B模型及轻量级商用API(如Gemini2.5-Flash),同时达到了与更强API(如Gemini2.5-Pro)相当的水平。通过保留中间步骤、检索标签等元信息,InfoSeek还支持包括复合奖励设计、轨迹级探索在内的高级优化策略。 ## 🔆 概览 我们提出**InfoSeek**:一个可扩展的数据合成框架,用于构建结构复杂的深度研究任务。InfoSeek设计了双智能体系统,通过从大规模文本中挖掘实体与关系,递归构建**研究树**,并将中间顶点转化为合法子问题。随后智能体将这些树转换为自然语言问题,其解答需要遍历完整层级结构。通过InfoSeek流程,我们构建了一个高质量、复杂度可控且本质可验证的数据集。 ### 示例1: **问题**:请找出一种鸟类,其于1818年至1824年间由受雇于其父亲的人命名,该鸟类的配偶为英国艺术家,且拥有三个亚种,体长通常不超过6英寸? **答案**:树麻雀(Russet sparrow) <details> <summary>树结构</summary> { "root": { "id": "A", "entity": "Russet sparrow", "question": "What is a species of bird that was named by a person employed under his father between 1818 and 1824, whose wife was a British artist, and which has three subspecies and body length is generally no more than 6 inches?", "claims": [ { "target_id": "B", "claim": "A was named by B" }, { "target_id": "C", "claim": "A has three subspecies" }, { "target_id": "D", "claim": "A's body length is generally no more than 6 inches" } ], "children": [ { "id": "B", "entity": "John Gould", "claims": [ { "target_id": "E", "claim": "B was employed by his father between 1818 and 1824" }, { "target_id": "F", "claim": "B's wife was F" } ], "children": [ { "id": "E", "entity": "None", "claims": [], "children": [] }, { "id": "F", "entity": "Elizabeth Gould", "claims": [], "children": [] } ] }, { "id": "C", "entity": "None", "claims": [], "children": [] }, { "id": "D", "entity": "None", "claims": [], "children": [] } ] } } (A: Russet sparrow) │ │ │── [claim] "was named by" ──> (B: John Gould) │ │ │ │ │ │── [claim] "was employed by his father (1818-1824)" │ │ │ │ │ │── [claim] "wife was" ──> (F: Elizabeth Gould) │ │ │── [claim] "has three subspecies" │ │ │── [claim] "body length is generally no more than 6 inches" </details> ### 示例2: **问题**:请找出一支女子足球队,其在德国足球乙级联赛(2. Bundesliga)的首粒进球由一名出生于科罗戈乔(Korogocho)的球员打入,且该球员是由马萨雷青年体育协会(Mathare Youth Sports Association)发掘并培养的? **答案**:云达不莱梅女子足球队(SV Werder Bremen (women)) <details> <summary>树结构</summary> { "root": { "id": "A", "entity": "SV Werder Bremen (women)", "question": "What is a women's football team whose first goals in the 2. Bundesliga were scored by a player born in Korogocho, who was discovered and developed by the Mathare Youth Sports Association?", "claims": [ { "target_id": "B", "claim": "A's first goals in the 2. Bundesliga were scored by B" } ], "children": [ { "id": "B", "entity": "Doreen Nabwire", "claims": [ { "target_id": "C", "claim": "B was discovered and developed by C" }, { "target_id": "D", "claim": "B was born in D" } ], "children": [ { "id": "C", "entity": "Mathare Youth Sports Association", "claims": [], "children": [] }, { "id": "D", "entity": "Korogocho", "claims": [], "children": [] } ] } ] } } (A: SV Werder Bremen (women)) │ │ │── [claim] "first goals scored by" ──> (B: Doreen Nabwire) │ │ │── [claim] "discovered and developed by" ──> (C:Mathare Youth Sports Association) │ │ │── [claim] "was born in" ──> (D: Korogocho) </details> ## 📊 性能表现 基于InfoSeek及我们的框架训练的模型,在传统多跳基准数据集上展现出优异性能: <img src="https://github.com/VectorSpaceLab/InfoSeek/raw/main/assets/results.png" width="800"> 我们的3B参数大语言模型在[BrowseComp-Plus](https://github.com/texttron/BrowseComp-Plus)基准上展现出极具竞争力的结果: <img src="https://github.com/VectorSpaceLab/InfoSeek/raw/main/assets/browsecomp_plus.png" width="800"> ## ❤️ 引用我们 若您认为本仓库或相关工作具有参考价值,欢迎点亮⭐Star,或引用我们的论文,不胜感激: bibtex @misc{xia2025opendatasynthesisdeep, title={Open Data Synthesis For Deep Research}, author={Ziyi Xia and Kun Luo and Hongjin Qian and Zheng Liu}, year={2025}, eprint={2509.00375}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.00375}, }
提供机构:
maas
创建时间:
2025-09-08
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作