Nanbeige/ToolMind-Web-QA

Name: Nanbeige/ToolMind-Web-QA
Creator: Nanbeige
Published: 2026-02-19 07:53:07
License: 暂无描述

Hugging Face2026-02-19 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Nanbeige/ToolMind-Web-QA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 configs: - config_name: test task_categories: - text-generation language: - en tags: - synthetic - deep search pretty_name: ToolMind-Web-QA --- ## Dataset Summary * ToolMind-Web-QA is a validated public dataset designed for research on **search-augmented and long-horizon search agents**. The dataset contains 6k complex question-answer (QA) pairs synthesized from Wikipedia entity-relation knowledge graphs and also includes trajectories, averaged over 100 turns, constructed through advanced search agents. The dataset emphasizes multi-hop reasoning, evidence-grounded answers, and search-oriented problem-solving. ## Data Construction * **Temporal-Aware Head Entity Selection.** To ensure the timeliness and complexity of the synthesized QA data, we extract informative head entities from Wikipedia that have been updated within the past six months. * **Question Synthesis with Random Walking.** Questions are generated by composing multi-hop paths over Wikipedia-derived entity–relation graphs and converted into natural-language queries. All QA instances in this preview release are validated for factual consistency and answer correctness. * **Trajectory Synthesis and Turn-level Judgment.** The trajectory is synthesized with the Mirothinker framework, with tools including Serper and Jina. The average number of iterations exceeds 100. After acquiring successful trajectories, we performed judgment and selection at the turn level, ultimately retaining the most critical and valuable iterations for training. <div align="center"> <img src="toolmind-web.png"> </div> # Benchmark Results | Model | GAIA | BrowseComp | BrowseComp-ZH | HLE | Seal-0 | xBench-Deepsearch-05 | xBench-Deepsearch-10 | DSQA | |------|------|------------|---------------|-----|--------|-------------------|----------------------|------| | DeepSeek-V3.2 | 0.635 | 0.676 | 0.65 | 0.408 | 0.385 | 0.71 | | / | | MiniMax-M2 | 0.757 | 0.44 | 0.485 | 0.318 | / | 0.72 | | / | | GLM-4.6 | 0.719 | 0.451 | 0.495 | 0.304 | / | 0.7 | | / | | MiroThinker 8B | 0.664 | 0.311 | 0.402 | 0.215 | 0.404 | 0.606 | | / | | AgentCPM-Explore 4B | 0.639 | 0.25 | 0.29 | 0.191 | 0.4 | 0.7 | / | / | | **Ours**| | **ToolMind-Web-3B~(w Synthetic QA only)** | 0.583 | 0.144 | 0.301 | 0.224 | 0.36 | 0.76 | 0.3 | 0.308 | | **ToolMind-Web-3B** | 0.670 | 0.174 | 0.308 | 0.248 | 0.477 | 0.751 | 0.37 | 0.458 | | **Nanbeige4.1-3B** | 0.699 | 0.191 | 0.318 | 0.223 | 0.414 | 0.750 | 0.39 | 0.468 |   ## Overall Data Distribution * Some statistics about the data are as follows: | Statistic | # Count | |------------------------------------------------|---------| | **Number of Trajectories** | 5624 | | **Average Number of Conversations per Trajectory** | 138.66 | | **Average Number of Critical Turns per Trajectory** | 7.25 | | **Average Count of 'Search and Scrape Webpage'** | 45.04 | | **Average Count of 'Jina Scrape'** | 20.83 | | **Average Count of 'Python MCP Server'** | 1.40 | * Using a judging mechanism, we assessed the importance of each turn and analyzed the distribution of critical turns across the whole conversations. We found that most useful turns are concentrated in the earlier stages, with a significant deviation from the overall turn distribution. <div align="center"> <img src="position.png" width="500"> </div> ## Importance of Non-Critical Turns for Model Scaling * We conducted two experiments using partial data: * **Retention with Loss Exclusion**: keeps Non-Critical Turns in the context but excludes them from the loss calculation. * **Removal with Reasoning Augmentation**: Removes Non-Critical Turns from the context and refine the thinking process. | Model | xBench-Deepsearch-05 | |------|------| | Retention with Loss Exclusion | 0.60 | | Removal with Reasoning Augmentation | 0.33 | * Interestingly, we found that turns deemed unimportant play a crucial role in supporting the long context required for tool usage scaling. Removing these non-critical turns resulted in a marked decline in model performance. # <span id="Limitations">Citation</span> * If you find our model useful or want to use it in your projects, please cite as follows: ``` @misc{yang2026nanbeige413bsmallgeneralmodel, title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts}, author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen}, year={2026}, eprint={2602.13367}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.13367}, } ``` <br> # <span id="Limitations">Contact</span> * If you have any questions, please raise an issue or contact us at nanbeige@126.com. <br>

提供机构：

Nanbeige

5,000+

优质数据集

54 个

任务类型

进入经典数据集