WideSearch
收藏魔搭社区2025-12-04 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/ByteDance-Seed/WideSearch
下载链接
链接失效反馈官方服务:
资源简介:
# WideSearch: Benchmarking Agentic Broad Info-Seeking
## Dataset Summary
WideSearch is a benchmark designed to evaluate the capabilities of Large Language Model (LLM) driven agents in **broad information-seeking** tasks. Unlike existing benchmarks that focus on finding a single, hard-to-find fact, WideSearch assesses an agent's ability to handle tasks that require gathering a large amount of scattered, yet easy-to-find, information.
The challenge in these tasks lies not in cognitive difficulty, but in the operational scale, repetitiveness, and the need for **Completeness** and **Factual Fidelity** in the final result. For example, a financial analyst gathering key metrics for all companies in a sector, or a job seeker collecting every vacancy that meets their criteria.
The benchmark, originating from the research paper "WideSearch: Benchmarking Agentic Broad Info-Seeking," contains 200 meticulously designed tasks (100 in English, 100 in Chinese).
See our [paper](https://arxiv.org/abs/2508.07999) and [github repo](https://github.com/ByteDance-Seed/WideSearch) for more details.
## Dataset Structure
The dataset consists of these components: a task file, and a directory containing the ground-truth answers.
```
/
├── widesearch.jsonl
└── widesearch_gold/
├── ws_en_001.csv
├── ws_zh_001.csv
└── ...
```
### Data Instances
`widesearch.jsonl` is JSON Lines file, where each line represents a single task.
**Example:**
```json
{
"instance_id": "ws_en_001",
"query": "My son is about to start his university applications but he\u2019s still uncertain about both his major and which universities to apply to. Could you help me find the top five universities in each of the five broad subjects from the QS World University Rankings by Subject 2025, and also check their standings in the QS World University Rankings 2025 and the Times Higher Education World University Rankings 2025? And I need the home page of the university's official website, standard application deadline for regular decision as well as the application fee without the fee waiver.Please organize the results in one Markdown table with the following columns:\nSubject, University, QS World University Rankings by Subject 2025, QS World University Rankings 2025, Times Higher Education World University Rankings 2025, Home Page, Application Deadline, Application Fee\nPlease use the universities\u2019 full official names in English. \nUse only Arabic numerals in the ranking, for example: 1.\n\nThe output format is ```markdown\n{data_content}\n```.",
"evaluation": "{\"unique_columns\": [\"subject\", \"university\"], \"required\": [\"subject\", \"university\", \"qsworlduniversityrankingsbysubject2025\", \"qsworlduniversityrankings2025\", \"timeshighereducationworlduniversityrankings2025\", \"homepage\", \"applicationdeadline\", \"applicationfee\"], \"eval_pipeline\": {\"applicationdeadline\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"llm_judge\"], \"criterion\": \"It is sufficient if the semantics are approximately the same as the reference answer or if they point to the same entity. There is no need for a word-for-word correspondence.\\nThe month and day must be correct\"}, \"applicationfee\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"llm_judge\"], \"criterion\": \"It is sufficient if the semantics are approximately the same as the reference answer or if they point to the same entity. There is no need for a word-for-word correspondence.\\nIf there are multiple fees in the reference answer, all must be included.\"}, \"homepage\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"url_match\"]}, \"subject\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"university\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"qsworlduniversityrankingsbysubject2025\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"qsworlduniversityrankings2025\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"timeshighereducationworlduniversityrankings2025\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}}}",
"language": "en"
}
```
```json
{
"instance_id": "ws_zh_001",
"query": "我要做电影研究,需要你列出来2020年-2024年(包含2020年和2024年)每年中国、美国本国票房前五的电影,表头需要包括年份、国家(如中国、美国)、电影名、导演、本国整体累计票房收益(不局限于当年,以亿为单位,保留到小数点后一位,例如7.9亿元,需要带上各国货币单位,中国电影以亿元为单位,美国电影为亿美元为单位)、电影类型。请以Markdown表格的格式输出整理后的数据,全部输出采用中文。请注意,对于当年12月末上映的电影、大部分票房收益落在下一年的,视为下一年的电影。请以Markdown表格的格式输出整理后的数据。\n表格中的列名依次为:\n年份、国家、电影名、导演、本国累计票房收益、电影类型\n\n格式为```markdown\n{数据内容}\n```。",
"evaluation": "{\"unique_columns\": [\"国家\", \"电影名\"], \"required\": [\"年份\", \"国家\", \"电影名\", \"导演\", \"本国累计票房收益\", \"电影类型\"], \"eval_pipeline\": {\"国家\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"年份\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"exact_match\"]}, \"本国累计票房收益\": {\"preprocess\": [\"extract_number\"], \"metric\": [\"number_near\"], \"criterion\": 0.1}, \"导演\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"llm_judge\"], \"criterion\": \"和参考答案语义相同大致、或者指向的实体一致即可,不需要字字对应。\\n答出子集且未答出参考答案以外的内容时可算正确\"}, \"电影类型\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"llm_judge\"], \"criterion\": \"和参考答案语义相同大致、或者指向的实体一致即可,不需要字字对应。\\n答出参考答案中的部分类型(即子集)即视为正确、基于权威来源及官方依据的类型标注同样正确、答出其中一个子集其他类型内容合理也视为正确。\"}, \"电影名\": {\"preprocess\": [\"norm_str\"], \"metric\": [\"llm_judge\"], \"criterion\": \"和参考答案语义相同大致、或者指向的实体一致即可,不需要字字对应。\"}}}",
"language": "zh"
}
```
### Data Fields
* `instance_id` (string): A unique identifier for the task. This ID corresponds to the filename of the ground-truth CSV file in the `widesearch_gold` directory (e.g., `ws_en_001` corresponds to `ws_en_001.csv`).
* `query` (string): The natural language instruction given to the AI agent. It details the task requirements, the data columns to be collected, and the final Markdown table format.
* `evaluation` (string): A string representation of an object containing all the information necessary for automated evaluation.
* `unique_columns` (list): The primary key column(s) used to uniquely identify a row in the table.
* `required` (list): All column names that must be present in the agent's generated response.
* `eval_pipeline` (dict): Defines the evaluation method for each column.
* `preprocess` (list): Preprocessing steps to be applied to the cell data before evaluation (e.g., `norm_str` to normalize strings, `extract_number` to extract numbers).
* `metric` (list): The metric used to compare the predicted value with the ground truth (e.g., `exact_match`, `number_near` for numerical approximation, `llm_judge` for judgment by an LLM).
* `criterion` (float or string): Specific criteria for the metric. For `number_near`, this is the allowed relative tolerance; for `llm_judge`, it's the scoring guide for the "judge" LLM.
* `language` (string): The language of the task (`en` or `zh`).
### Ground Truth Data
The `widesearch_gold/` directory contains the ground-truth answers for each task, stored in CSV format. Filenames correspond to the `instance_id`. These files were created by human experts through exhaustive web searches and cross-validation, representing a high-quality "gold standard".
## Citation
If you use this dataset in your research, please cite the following paper:
```bibtex
@misc{wong2025widesearchbenchmarkingagenticbroad,
title={WideSearch: Benchmarking Agentic Broad Info-Seeking},
author={Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang},
year={2025},
eprint={2508.07999},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07999},
}
```
# WideSearch:面向智能体广谱信息检索的基准测试
## 数据集摘要
WideSearch是一款基准数据集,用于评估由**大语言模型(Large Language Model)**驱动的**AI智能体(AI Agent)**在**广谱信息检索**任务中的能力。与现有聚焦于查找单一冷门事实的基准测试不同,WideSearch用于评估智能体处理需收集大量零散但易于获取的信息的任务的能力。
此类任务的挑战并非源于认知难度,而是在于操作规模、重复性需求,以及最终结果需满足**完整性**与**事实保真度**的要求。例如,金融分析师收集某一行业内所有企业的核心指标,或是求职者筛选所有符合自身条件的职位空缺。
本基准数据集源自论文《WideSearch:面向智能体广谱信息检索的基准测试》,共包含200个精心设计的任务(英文任务100个,中文任务100个)。
如需了解更多细节,请参阅我们的[论文](https://arxiv.org/abs/2508.07999)与[GitHub仓库](https://github.com/ByteDance-Seed/WideSearch)。
## 数据集结构
本数据集由两部分组成:任务文件与包含标准答案的目录。
/
├── widesearch.jsonl
└── widesearch_gold/
├── ws_en_001.csv
├── ws_zh_001.csv
└── ...
### 数据实例
`widesearch.jsonl` 为JSON Lines格式文件,每一行对应一个独立任务。
**示例:**
json
{
"instance_id": "ws_en_001",
"query": "我的儿子即将开始大学申请,但他仍不确定专业和申请院校。请帮我查询2025年QS世界大学学科排名中5个宽泛学科各自的前五名大学,同时查看这些大学在2025年QS世界大学排名以及2025年泰晤士高等教育世界大学排名中的位次。此外,请提供该大学官方网站的主页地址、常规录取的标准申请截止日期,以及不包含减免政策的申请费用。请将结果整理为一张Markdown表格,列名依次为:学科、大学、2025年QS世界大学学科排名、2025年QS世界大学排名、2025年泰晤士高等教育世界大学排名、官方主页、申请截止日期、申请费用。请使用大学的完整官方英文名称。排名仅使用阿拉伯数字,例如:1.。输出格式为markdown
{数据内容}
。",
"evaluation": "{"unique_columns": ["学科", "大学"], "required": ["subject", "university", "qsworlduniversityrankingsbysubject2025", "qsworlduniversityrankings2025", "timeshighereducationworlduniversityrankings2025", "homepage", "applicationdeadline", "applicationfee"], "eval_pipeline": {"applicationdeadline": {"preprocess": ["norm_str"], "metric": ["llm_judge"], "criterion": "只要语义与参考答案大致相符,或指向同一实体即可,无需逐字对应。月份和日期必须准确。"}, "applicationfee": {"preprocess": ["norm_str"], "metric": ["llm_judge"], "criterion": "只要语义与参考答案大致相符,或指向同一实体即可,无需逐字对应。若参考答案中包含多项费用,则必须全部涵盖。"}, "homepage": {"preprocess": ["norm_str"], "metric": ["url_match"]}, "subject": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "university": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "qsworlduniversityrankingsbysubject2025": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "qsworlduniversityrankings2025": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "timeshighereducationworlduniversityrankings2025": {"preprocess": ["norm_str"], "metric": ["exact_match"]}}}",
"language": "en"
}
json
{
"instance_id": "ws_zh_001",
"query": "我要做电影研究,需要你列出来2020年-2024年(包含2020年和2024年)每年中国、美国本国票房前五的电影,表头需要包括年份、国家(如中国、美国)、电影名、导演、本国整体累计票房收益(不局限于当年,以亿为单位,保留到小数点后一位,例如7.9亿元,需要带上各国货币单位,中国电影以亿元为单位,美国电影为亿美元为单位)、电影类型。请以Markdown表格的格式输出整理后的数据,全部输出采用中文。请注意,对于当年12月末上映的电影、大部分票房收益落在下一年的,视为下一年的电影。请以Markdown表格的格式输出整理后的数据。
表格中的列名依次为:
年份、国家、电影名、导演、本国累计票房收益、电影类型
格式为markdown
{数据内容}
。",
"evaluation": "{"unique_columns": ["国家", "电影名"], "required": ["年份", "国家", "电影名", "导演", "本国累计票房收益", "电影类型"], "eval_pipeline": {"国家": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "年份": {"preprocess": ["norm_str"], "metric": ["exact_match"]}, "本国累计票房收益": {"preprocess": ["extract_number"], "metric": ["number_near"], "criterion": 0.1}, "导演": {"preprocess": ["norm_str"], "metric": ["llm_judge"], "criterion": "和参考答案语义大致相同、或者指向的实体一致即可,不需要字字对应。
答出子集且未答出参考答案以外的内容时可算正确"}, "电影类型": {"preprocess": ["norm_str"], "metric": ["llm_judge"], "criterion": "和参考答案语义大致相同、或者指向的实体一致即可,不需要字字对应。
答出参考答案中的部分类型(即子集)即视为正确、基于权威来源及官方依据的类型标注同样正确、答出其中一个子集其他类型内容合理也视为正确。"}, "电影名": {"preprocess": ["norm_str"], "metric": ["llm_judge"], "criterion": "和参考答案语义大致相同、或者指向的实体一致即可,不需要字字对应。"}}}",
"language": "zh"
}
### 数据字段
* `instance_id`(字符串类型):任务的唯一标识符。该标识符与`widesearch_gold`目录下对应的标准答案CSV文件名一致(例如`ws_en_001`对应`ws_en_001.csv`)。
* `query`(字符串类型):向AI智能体下达的自然语言指令,详细说明了任务要求、需收集的数据列以及最终输出的Markdown表格格式。
* `evaluation`(字符串类型):包含自动化评估所需全部信息的对象的字符串化形式。
* `unique_columns`(列表类型):用于唯一标识表格中行数据的主键列。
* `required`(列表类型):智能体生成的回复中必须包含的所有列名。
* `eval_pipeline`(字典类型):定义各列的评估方法。
* `preprocess`(列表类型):评估前对单元格数据应用的预处理步骤(例如`norm_str`用于字符串标准化,`extract_number`用于提取数值)。
* `metric`(列表类型):用于对比模型预测值与标准答案的评估指标(例如`exact_match`为精确匹配,`number_near`用于数值近似匹配,`llm_judge`用于通过大语言模型进行判断)。
* `criterion`(浮点数或字符串类型):对应评估指标的具体准则。对于`number_near`,该值为允许的相对容差;对于`llm_judge`,该值为“评判”大语言模型的评分指南。
* `language`(字符串类型):任务所用语言(`en`代表英文,`zh`代表中文)。
### 标准答案数据
`widesearch_gold/` 目录存储了每个任务的标准答案,格式为CSV文件,文件名与`instance_id`一一对应。此类文件由人类专家通过全面的网络检索与交叉验证制作而成,代表了高质量的“金标准”数据集。
## 引用
若您在研究中使用本数据集,请引用以下论文:
bibtex
@misc{wong2025widesearchbenchmarkingagenticbroad,
title={WideSearch: Benchmarking Agentic Broad Info-Seeking},
author={Ryan Wong and Jiawei Wang and Junjie Zhao and Li Chen and Yan Gao and Long Zhang and Xuan Zhou and Zuo Wang and Kai Xiang and Ge Zhang and Wenhao Huang and Yang Wang and Ke Wang},
year={2025},
eprint={2508.07999},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07999},
}
提供机构:
maas
创建时间:
2025-08-06



