DeepWideSearch
收藏魔搭社区2026-05-15 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AIDC-AI/DeepWideSearch
下载链接
链接失效反馈官方服务:
资源简介:
# 🌐 DeepWideSearch: Evaluating Deep and Wide Agentic Information Seeking
[](LICENSE)
[](https://www.python.org/downloads/)
[](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch)
<div align="center">
⭐ _**MarcoPolo Team**_ ⭐
[_**Alibaba International Digital Commerce**_](https://aidc-ai.com)
<img src="https://octodex.github.com/images/original.png" alt="GitHub Octocat" width="22" height="22"> [**Github**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/DeepWideSearch) 🤗 [**Hugging Face**](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch) 📝 [**Paper**](https://huggingface.co/papers/2510.20168) 🗂️ [**Data**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/DeepWideSearch/data)
</div>
---
## 🎯 Motivation
Given two aspects of search depth and width, existing benchmarks fall into four categories:
- **Low width, high depth benchmarks** (e.g., GAIA, BrowseComp): focus on intricate deep reasoning over multi-hop retrieval for searching target answers
- **Low width, low depth benchmarks** (e.g., TriviaQA, HotpotQA): address simple fact-finding tasks
- **High width, low depth** (e.g., WideSearch and PaSa): emphasize broad information collection about specific questions
- **High width, high depth** (Our proposed DeepWideSearch): collect extensive information that required deep reasoning—a critical capability for real-world applications
As shown in the Table, our proposed DeepWideSearch exhibit significant challenging in searching scope and difficulty.
<img src="assets/teaser_img_v2.png" alt="DeepWideSearch Teaser" width="600" style="max-width: 100%; height: auto; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
*DeepWideSearch: Bridging depth and width in information seeking*
**DeepWideSearch** is the first benchmark designed to evaluate LLM-based agents on **simultaneous deep reasoning over multi-hop retrieval and wide-scale information retrieval**—a critical capability for real-world tasks like market analysis and business developmen. The output of this task is a tabular. Rows are candidate answers of the questions and columns are attributes of each candidate that questions required to collect.

---
## 🔥 News
* [2025/10/] 🔥 We released the [paper](https://huggingface.co/papers/2510.20168) and [dataset](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch) of our challenging DeepWideSearch benchmarl.
---
## 📊 Benchmark Construction and Overview
### Dataset Conversion
To address the challenge of constructing DeepWideSearch instances from scratch, we propose two conversion methods.

**Deep2Wide Conversion**: We convert existing deep search benchmarks (BrowseComp, BrowseComp-zh) by expanding their scope. The process involves: (1) filtering 100 questions to identify suitable core entities, (2) designing structured table schemas, and (3) comprehensive human annotation to populate tables with verified information.
**Wide2Deep Conversion**: We transform WideSearch queries by introducing complexity in entity identification. The pipeline includes: (1) extracting core entities from 160 WideSearch questions, (2) generating complex sub-questions requiring additional web searches, (3) fusing deep sub-questions with original queries, and (4) human validation to ensure quality and complexity.
Both methods require significant human annotation effort (30-40 minutes per instance) to maintain high-quality standards and ensure the benchmark's rigor. Ouur proposed DeepWideSearch exhibit large search scope (table volume) and search depth (avg. steps for searching).
### Dataset Example and Statistics


---
## 🧪 Evaluation Metrics
We evaluate agent performance along three complementary axes: **Depth**, **Width**, and **Efficiency**.
### Depth Evaluation
Measures the capability of agents to correctly identify target entities through deep reasoning over multi-hop retrieval:
- **Column-F1**: F1 score over unique columns in the table, corresponding to core attributes that uniquely identify entities
- **Core Entity Accuracy (CE Acc.)**: Accuracy in identifying the core entity of questions
### Width Evaluation
Measures how comprehensively and accurately agents retrieve all associated information units:
- **Success Rate**: Binary metric for exact match with human-annotated ground truth (all rows, columns, and values identical)
- **Row-level F1**: Precision, recall, and F1 scores at the entity level, capturing complete contextual information per entity
- **Item-level F1**: Finest-grained metric evaluating accuracy at individual cell level
### Evaluation Protocol
We conduct **4 independent runs** per question and report three statistics:
- **Avg@4**: Mean performance across all four runs
- **Max@4**: Best performance observed across the four runs
- **Pass@4**: Proportion of questions solved successfully in at least one run (Success Rate only)
---
## 🚀 Sample Usage
### 📁 Repository Structure
```
DeepWideSearch/
├── data/ # Full benchmark (JSON format and CSV tables)
├── eval/ # Evaluation codebase
├── scripts/ # Evaluation scripts
├── assets/
├── LICENSE
├── requirements.txt
└── README.md
```
---
### Install
```bash
git clone https://github.com/AIDC-AI/Marco-Search-Agent
cd DeepWideSearch
pip install -r requirements.txt
```
### Inference Your Agent to Obtain the results
The folder structure should be like this:
```
YOUR_MODEL_GENERATION_PATH
├── claude-sonnet-4
├── deepseek-r1
├── deepseek-v3
├── gemini-2.5-pro
├── gpt-4o
├── gpt-5
├── kimi-k2
├── o3-mini
├── owl_claude-sonnet-4
├── owl_gemini-2.5-pro
├── owl_gpt-5
├── qwen-max
├── qwen3-235b-a22b
├── qwen3-235b-a22b-instruct
├── qwen3-32b
├── smolagents_claude-sonnet-4
├── smolagents_gpt-5
├── smolgents_gemini-2.5-pro
├── websailor_claude4
├── websailor_gemini-2.5-pro
└── websailor_gpt-5
```
Each model sub-folder have following 4 files consisting the jsonl format data, for example:
```
YOUR_MODEL_GENERATION_PATH/deepseek-r1
├── iter1.jsonl
├── iter2.jsonl
├── iter3.jsonl
└── iter4.jsonl
```
These four files will be used to evaluate Avg@4, Max@4 and Pass@4 metrics.
For each jsonl file, its format is like
```jsonl
{"instance_id": "deep2wide_result_5_Lin Dan", "question": "There is a Chinese athlete who has achieved outstanding success in a racket sport. He was the first player in his discipline to successfully defend a major championship title and holds multiple world championship titles. His sport underwent significant rule changes in the early 21st century, and he became the first male singles Olympic champion in the post-rule-change era. Please help me compile and summarize this athlete’s competition records between 2010 and 2020 into a clear Markdown table, including the following columns: Date, Tournament Name, Level, Event, Result, and Match Details (including opponent, score, and win/loss outcome) ...", "rollout_id": 1, "prediction": "...", "messages": [{"role": "system", "content": "You are a Web Information Seeking Master. Your task is to thoroughly seek the internet for information and provide accurate answers to questions ..."}, {"role": "user", "content": "A conversation between User and Assistant ..."}, {"role": "assistant", "content": " ... "}]}
```
The jsonl file must consist of following keys:
* instance_id records the instance id
* question passed to the LLMs or Agents for solving
* prediction: the content of the last assistant response
* rollout_id: 1/2/3/4
* messages: the last turn of model generation will be used
### Evaluate Your Agent
```bash
#!/bin/bash
THREAD_NUM=4
models=("o3-mini" "claude-sonnet-4" "gemini-2.5-pro" "qwen-max" "deepseek-r1" "deepseek-v3" "kimi-k2" "qwen3-235b-a22b" "qwen3-235b-a22b-instruct" "qwen3-32b" "gpt-4o" "owl_claude-sonnet-4" "owl_gemini-2.5-pro" "owl_gemini-2.5-pro" "smolagents_claude-sonnet-4" "smolgents_gemini-2.5-pro" "smolagents_gpt-5" "websailor_gpt-5" "websailor_claude4" "websailor_gemini-2.5-pro")
for model in "${models[@]}"
do
echo "============================"
echo "evaluate $model begin"
echo "============================"
./scripts/eval.sh $model $THREAD_NUM
done
```
---
## 📈 Key Findings

Base LLMs (even GPT-5, Claude Sonnet 4) score **< 1% Success Rate**. Best agent (**WebSailor + Claude Sonnet 4**) achieves only **2.39% Avg@4 Success Rate**.
<img src="assets/efficiency_metric.jpg" alt="Closed Source Main Exp Result" width="75%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);">
Solving deep and wide search questions leads to huge inference cost. For example, OWL (GPT-5) requires over 2.75\$ for each question, even most questions are not solved.
Besides, due to unstable network conditions and tool call errors, agents often require multiple retry attempts to complete tasks such as search, significantly increasing computational overhead—for instance, OWL (GPT-5) incurs
an average cost exceeding $6.8 under retry conditions.
---
## 🤝 Contributing
This project builds upon the great open-source implementation of [WideSearch](https://github.com/ByteDance-Seed/WideSearch) by ByteDance. We sincerely thank the ByteDance-Seed team for their pioneering work and for releasing WideSearch under the MIT open-source license. Our data construction pipeline, evaluation metrics, and codebase are heavily inspired by their framework. We acknowledge their contribution as a foundational component of DeepWideSearch.
We welcome:
- New test cases (especially from real industrial scenarios)
- Improved evaluation metrics
- Agent implementations & baselines
Please open an issue/PR or contact us ([Tian Lan](https://github.com/gmftbyGMFTBY) and [Longyue Wang](https://www.longyuewang.com/)).
---
## 🛡️ License
This project is licensed under the **Apache-2.0 License**
---
# 🌐 DeepWideSearch:评估深度与广度兼具的智能体信息检索任务
[](LICENSE)
[](https://www.python.org/downloads/)
[](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch)
<div align="center">
⭐ _**MarcoPolo团队**_ ⭐
[_**阿里巴巴国际数字商业**_](https://aidc-ai.com)
<img src="https://octodex.github.com/images/original.png" alt="GitHub Octocat" width="22" height="22"> [**Github**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/DeepWideSearch) 🤗 [**Hugging Face**](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch) 📝 [**论文**](https://huggingface.co/papers/2510.20168) 🗂️ [**数据集**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/DeepWideSearch/data)
</div>
---
## 🎯 研究动机
针对检索深度与广度两个维度,现有基准评测数据集可分为四类:
- **低广度、高深度基准数据集**(如GAIA、BrowseComp):聚焦于多跳检索下的复杂深度推理,以定位目标答案
- **低广度、低深度基准数据集**(如TriviaQA、HotpotQA):面向简单的事实查询任务
- **高广度、低深度基准数据集**(如WideSearch与PaSa):侧重针对特定问题的大规模信息收集
- **高广度、高深度基准数据集**(本文提出的DeepWideSearch):收集需要深度推理的海量信息,这是现实应用中的核心能力
如表所示,本文提出的DeepWideSearch在检索范围与难度上均具备极强的挑战性。
<img src="assets/teaser_img_v2.png" alt="DeepWideSearch 示例图" width="600" style="max-width: 100%; height: auto; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
*DeepWideSearch:打通信息检索中的深度与广度*
**DeepWideSearch**是首个用于评估基于大语言模型(Large Language Model,LLM)的AI智能体(AI Agent)同时完成多跳检索深度推理与大规模信息检索的基准评测数据集——这一能力是市场分析、商业开发等现实任务的核心需求。该任务的输出为结构化表格:行对应问题的候选答案,列则为问题要求收集的各候选答案的属性信息。

---
## 🔥 最新动态
* [2025/10] 🔥 我们发布了具有高挑战性的DeepWideSearch基准数据集的[论文](https://huggingface.co/papers/2510.20168)与[数据集](https://huggingface.co/datasets/AIDC-AI/DeepWideSearch)。
---
## 📊 基准数据集构建与概览
### 数据集构建方式
为解决从零构建DeepWideSearch样本的难题,本文提出两种转换方法。

**Deep2Wide转换法**:通过扩大检索范围,将现有深度检索基准数据集(BrowseComp、BrowseComp-zh)转换为DeepWideSearch样本。具体流程包括:(1) 筛选100个问题以提取合适的核心实体;(2) 设计结构化表格Schema;(3) 开展全面人工标注,为表格填充经过验证的信息。
**Wide2Deep转换法**:通过增加实体识别的复杂度,将WideSearch的查询转换为DeepWideSearch样本。流程包括:(1) 从160个WideSearch问题中提取核心实体;(2) 生成需要额外网络检索的复杂子问题;(3) 将深度子问题与原始查询融合;(4) 开展人工校验以确保数据质量与复杂度。
两种方法均需要大量人工标注工作(每个样本耗时30-40分钟),以保证数据集的高质量与严谨性。本文提出的DeepWideSearch具备较大的检索范围(表格体量)与检索深度(平均检索步数)。
### 数据集示例与统计信息


---
## 🧪 评测指标
我们从三个互补维度评估AI智能体的性能:**深度**、**广度**与**效率**。
### 深度评测
用于衡量AI智能体通过多跳检索的深度推理,正确识别目标实体的能力:
- **列级F1分数(Column-F1)**:针对表格中唯一列的F1分数,这些列对应用于唯一标识实体的核心属性
- **核心实体准确率(Core Entity Accuracy,CE Acc.)**:识别问题核心实体的准确率
### 广度评测
用于衡量AI智能体全面且准确地检索所有关联信息单元的能力:
- **成功率(Success Rate)**:与人工标注的标准答案完全匹配的二元指标(要求所有行、列与数值均完全一致)
- **行级F1分数(Row-level F1)**:实体层面的精确率、召回率与F1分数,用于衡量每个实体的完整上下文信息捕获情况
- **单元格级F1分数(Item-level F1)**:最细粒度的评测指标,用于评估单个单元格的预测准确性
### 评测流程
我们针对每个问题开展**4次独立运行**,并报告三项统计指标:
- **Avg@4**:四次运行的平均性能
- **Max@4**:四次运行中的最优性能
- **Pass@4**:至少在一次运行中成功解决的问题占比(仅适用于成功率指标)
---
## 🚀 快速使用示例
### 📁 仓库结构
DeepWideSearch/
├── data/ # 完整基准数据集(JSON与CSV表格格式)
├── eval/ # 评测代码库
├── scripts/ # 评测脚本
├── assets/ # 资源文件
├── LICENSE
├── requirements.txt
└── README.md
---
### 安装方式
bash
git clone https://github.com/AIDC-AI/Marco-Search-Agent
cd DeepWideSearch
pip install -r requirements.txt
### 运行你的AI智能体以生成结果
文件夹结构应如下所示:
YOUR_MODEL_GENERATION_PATH
├── claude-sonnet-4
├── deepseek-r1
├── deepseek-v3
├── gemini-2.5-pro
├── gpt-4o
├── gpt-5
├── kimi-k2
├── o3-mini
├── owl_claude-sonnet-4
├── owl_gemini-2.5-pro
├── owl_gpt-5
├── qwen-max
├── qwen3-235b-a22b
├── qwen3-235b-a22b-instruct
├── qwen3-32b
├── smolagents_claude-sonnet-4
├── smolagents_gpt-5
├── smolgents_gemini-2.5-pro
├── websailor_claude4
├── websailor_gemini-2.5-pro
└── websailor_gpt-5
每个模型子文件夹包含以下4个JSONL格式的文件,示例如下:
YOUR_MODEL_GENERATION_PATH/deepseek-r1
├── iter1.jsonl
├── iter2.jsonl
├── iter3.jsonl
└── iter4.jsonl
这四个文件将用于计算Avg@4、Max@4与Pass@4指标。
对于每个JSONL文件,其格式如下:
jsonl
{"instance_id": "deep2wide_result_5_Lin Dan", "question": "There is a Chinese athlete who has achieved outstanding success in a racket sport. He was the first player in his discipline to successfully defend a major championship title and holds multiple world championship titles. His sport underwent significant rule changes in the early 21st century, and he became the first male singles Olympic champion in the post-rule-change era. Please help me compile and summarize this athlete’s competition records between 2010 and 2020 into a clear Markdown table, including the following columns: Date, Tournament Name, Level, Event, Result, and Match Details (including opponent, score, and win/loss outcome) ...", "rollout_id": 1, "prediction": "...", "messages": [{"role": "system", "content": "You are a Web Information Seeking Master. Your task is to thoroughly seek the internet for information and provide accurate answers to questions ..."}, {"role": "user", "content": "A conversation between User and Assistant ..."}, {"role": "assistant", "content": " ... "}]}
JSONL文件必须包含以下字段:
* instance_id:样本的唯一标识符
* question:提交给大语言模型或AI智能体以解决的问题
* prediction:模型最后一轮的回复内容
* rollout_id:运行轮次,取值为1/2/3/4
* messages:将使用模型生成的最后一轮对话历史
### 评测你的AI智能体
bash
#!/bin/bash
THREAD_NUM=4
models=("o3-mini" "claude-sonnet-4" "gemini-2.5-pro" "qwen-max" "deepseek-r1" "deepseek-v3" "kimi-k2" "qwen3-235b-a22b" "qwen3-235b-a22b-instruct" "qwen3-32b" "gpt-4o" "owl_claude-sonnet-4" "owl_gemini-2.5-pro" "owl_gemini-2.5-pro" "smolagents_claude-sonnet-4" "smolgents_gemini-2.5-pro" "smolagents_gpt-5" "websailor_gpt-5" "websailor_claude4" "websailor_gemini-2.5-pro")
for model in "${models[@]}"
do
echo "============================"
echo "evaluate $model begin"
echo "============================"
./scripts/eval.sh $model $THREAD_NUM
done
---
## 📈 核心发现

基础大语言模型(即便GPT-5、Claude Sonnet 4)的成功率均**低于1%**。最优AI智能体(**WebSailor + Claude Sonnet 4**)的Avg@4成功率仅为**2.39%**。
<img src="assets/efficiency_metric.jpg" alt="闭源模型主实验结果" width="75%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);">
解决深度与广度兼具的检索问题会带来极高的推理成本。例如,OWL(GPT-5)处理每个问题的成本超过2.75美元,即便大多数问题未能解决。此外,由于网络条件不稳定与工具调用错误,AI智能体往往需要多次重试才能完成检索等任务,这大幅增加了计算开销——例如,OWL(GPT-5)在重试场景下的平均成本超过6.8美元。
---
## 🤝 贡献指南
本项目基于字节跳动开源的[WideSearch](https://github.com/ByteDance-Seed/WideSearch)实现。我们衷心感谢ByteDance-Seed团队的开创性工作,并感谢其以MIT开源许可证发布WideSearch。本文的数据构建流程、评测指标与代码库均大量借鉴了其框架,我们认可其作为DeepWideSearch基础组件的贡献。
我们欢迎以下贡献:
- 新增测试用例(尤其是来自真实工业场景的用例)
- 改进的评测指标
- AI智能体实现与基准模型
欢迎提交Issue/PR或联系我们([王天澜](https://github.com/gmftbyGMFTBY)与[王龙跃](https://www.longyuewang.com/))。
---
## 🛡️ 许可证
本项目采用**Apache-2.0开源许可证**
提供机构:
maas
创建时间:
2025-10-27



