google/deepsearchqa
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/google/deepsearchqa
下载链接
链接失效反馈官方服务:
资源简介:
DeepSearchQA是Google DeepMind推出的一个包含900个提示的事实性基准测试,旨在评估代理在17个不同领域中执行复杂多步信息搜索任务的能力。该数据集不同于传统的单答案检索或广谱事实性测试,它包含一系列具有挑战性的手工制作任务,旨在评估代理执行复杂搜索计划以生成详尽答案列表的能力。每个任务都构建为一个“因果链”,其中发现一个步骤的信息依赖于前一步骤的成功完成,强调了长期规划和上下文保留。所有任务都基于开放网络,具有客观可验证的答案集。DeepSearchQA用于评估具有网络访问能力的LLM或LLM代理。数据集包含900个示例,每个示例包括问题(problem)、问题类别(problem_category)、黄金答案(answer)和答案类型分类(answer_type)。65%的答案类型为“集合答案”。
DeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind, designed to evaluate agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single-answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, hand-crafted tasks designed to evaluate an agent’s ability to execute complex search plans to generate exhaustive answer lists. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. DeepSearchQA is meant to be used to evaluate LLMs or LLM agents with access to the web. The dataset is a collection of 900 examples. Each example is composed of a problem (`problem`), a problem category (`problem_category`), a gold answer (`answer`), and an answer type classification (`answer_type`). 65% of answers are of type `Set Answer`.
提供机构:
google



