five

deepsearchqa

收藏
魔搭社区2026-05-07 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/google/deepsearchqa
下载链接
链接失效反馈
官方服务:
资源简介:
# DeepSearchQA #### A 900-prompt factuality benchmark from Google DeepMind, designed to evaluate agents on difficult multi-step information-seeking tasks across 17 different fields. ▶ [Google DeepMind Release Blog Post](https://blog.google/technology/developers/deep-research-agent-gemini-api/)\ ▶ [DeepSearchQA Leaderboard on Kaggle](https://www.kaggle.com/benchmarks/google/dsqa)\ ▶ [Technical Report](https://storage.googleapis.com/deepmind-media/DeepSearchQA/DeepSearchQA_benchmark_paper.pdf)\ ▶ [Evaluation Starter Code](https://www.kaggle.com/code/andrewmingwang/deepsearchqa-starter-code) ## Benchmark DeepSearchQA is a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single-answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, hand-crafted tasks designed to evaluate an agent’s ability to execute complex search plans to generate exhaustive answer lists. Each task is structured as a "causal chain", where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. DeepSearchQA is meant to be used to evaluate LLMs or LLM agents with access to the web. ## Dataset Description This dataset is a collection of 900 examples. Each example is composed of: * A problem (`problem`) which is the prompt testing parametric knowledge. * A problem category (`problem_category`) specifying which of 17 different domains the problem belongs to. * A gold answer (`answer`) which is used in conjunction with the evaluation prompt to judge the correctness of an LLM's response. * An answer type classification (`answer_type`) specifying whether a single answer or set of answers is expected as a response. This information should NOT be given to the LLM during inference time. 65% of answers are of type `Set Answer`. See the [Technical Report](https://storage.googleapis.com/deepmind-media/DeepSearchQA/DeepSearchQA_benchmark_paper.pdf) for methodology details. ## Limitations While DeepSearchQA offers a robust framework for evaluating comprehensive retrieval, it relies on specific design choices that entail certain limitations. By employing an exclusively outcome-based evaluation, we effectively treat any agent that is evaluated as a black box. In the absence of trajectory data, it is difficult to distinguish between an agent that reasoned correctly and one that arrived at the correct list through inefficient or accidental means (e.g., lucky guessing). Additionally, the static web assumption, while necessary for reproducibility, limits the evaluation of “breaking news” retrieval where ground truth is volatile. A task’s ground truth may become outdated if source websites are removed or their content is significantly altered. This is a prevalent challenge for all benchmarks operating on the live web, necessitating periodic manual reviews and updates to the dataset. Questions, comments, or issues? Share your thoughts with us in the [discussion forum](https://www.kaggle.com/benchmarks/google/dsqa/discussion). ## Evaluation Prompt The autorater which should be used for DeepSearchQA is `gemini-2.5-flash` with the grading prompt found in the [starter notebook](https://www.kaggle.com/code/andrewmingwang/deepsearchqa-starter-code) on Kaggle. Using a different autorater model or grading prompt will likely result in statistically significant deviation in results. ## Citation Coming soon.

# DeepSearchQA #### 由谷歌深度思维(Google DeepMind)推出的包含900个提示词的事实性评测基准,旨在评估AI智能体(AI Agent)在17个不同领域内完成复杂多步信息检索任务的能力。 ▶ [谷歌深度思维官方博客文章](https://blog.google/technology/developers/deep-research-agent-gemini-api/) ▶ [Kaggle平台DeepSearchQA排行榜](https://www.kaggle.com/benchmarks/google/dsqa) ▶ [技术报告](https://storage.googleapis.com/deepmind-media/DeepSearchQA/DeepSearchQA_benchmark_paper.pdf) ▶ [评测入门代码](https://www.kaggle.com/code/andrewmingwang/deepsearchqa-starter-code) ## Benchmark DeepSearchQA是一款包含900个提示词的评测基准,用于评估AI智能体在17个不同领域内完成复杂多步信息检索任务的能力。与传统针对单答案检索或广谱事实性验证的基准不同,DeepSearchQA的数据集由一系列精心手工打造的高难度任务组成,旨在评测智能体执行复杂检索计划、生成全面答案列表的能力。 每个任务均采用「因果链」结构,某一步骤的信息获取依赖于前一步骤的顺利完成,以此考察智能体的长周期规划能力与上下文留存能力。所有任务均基于开放互联网构建,答案集合具备客观可验证性。 DeepSearchQA主要用于评估具备互联网访问能力的大语言模型(Large Language Model, LLM)或大语言模型智能体。 ## Dataset Description 本数据集共包含900条样本,每条样本由以下部分组成: * 问题(`problem`):用于测试模型参数化知识的测试提示词。 * 问题类别(`problem_category`):指明该问题所属的17个领域之一。 * 标准答案(`answer`):配合评测提示词,用于评判大语言模型回复的正确性。 * 答案类型分类(`answer_type`):指明模型回复需包含单个答案还是答案集合。该信息不可在大语言模型推理阶段提供给模型。其中65%的答案类型为`Set Answer`(集合答案)。 详细的方法学细节请参阅[技术报告](https://storage.googleapis.com/deepmind-media/DeepSearchQA/DeepSearchQA_benchmark_paper.pdf)。 ## Limitations 尽管DeepSearchQA为全面检索能力评估提供了一套稳健的框架,但其特定的设计思路也带来了若干局限性。由于仅采用基于结果的评估方式,我们实质上将所有被评测的智能体视为黑箱。在缺乏轨迹数据的情况下,很难区分推理过程正确的智能体,与通过低效甚至偶然方式(例如侥幸猜对)得到正确答案列表的智能体。此外,为保证可复现性而采用的静态互联网假设,限制了对「突发新闻」这类真值易变的检索任务的评估。若任务的源网站被移除或内容发生重大修改,该任务的标准答案可能会过时。这是所有基于实时互联网的评测基准普遍面临的挑战,因此需要定期对数据集进行人工审查与更新。 如有疑问、建议或问题,欢迎在[讨论论坛](https://www.kaggle.com/benchmarks/google/dsqa/discussion)与我们交流。 ## Evaluation Prompt DeepSearchQA需使用`gemini-2.5-flash`作为自动评分器,评分提示词可在Kaggle平台的[入门笔记本](https://www.kaggle.com/code/andrewmingwang/deepsearchqa-starter-code)中获取。使用其他自动评分模型或评分提示词,可能会导致结果出现具有统计学意义的显著偏差。 ## Citation 即将公布。
提供机构:
maas
创建时间:
2025-12-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作