five

LiveDRBench

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/microsoft/LiveDRBench
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for LiveDRBench: Deep Research as Claim Discovery [Arxiv Paper](https://arxiv.org/abs/2508.04183) | [Hugging Face Dataset](https://huggingface.co/datasets/microsoft/LiveDRBench) | [Evaluation Code](https://github.com/microsoft/LiveDRBench) We propose a formal characterization of the deep research (DR) problem and introduce a new benchmark, _LiveDRBench_, to evaluate the performance of DR systems. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search—separating the reasoning challenge from surface-level report generation. ## Dataset Details The benchmark consists of 100 challenging DR tasks over scientific topics (e.g., dataset discovery, materials discovery, novelty search, prior art discovery) and public interest events (e.g, the Oscars). The data was collected between May-June 2025. We plan to keep the benchmark live, and release periodic updates with new tasks. Each task consists of (a) a prompt with a short description of the task and the expected output format; and (b) ground-truth JSON containing the claims and references that should be uncovered. We also include an evaluation script for computing the performance of DR systems using information-retrieval metrics namely precision, recall, and F1 scores. The benchmark contains eight categories: SciFacts-Geo, SciFacts-Materials, NovelDatasets identification, NovelDatasets identification and extraction, NovelDatasets peer retrieval, PriorArt search, Entities, and Flight incidents. The evaluation code for the benchmark can be obtained at [Github](https://github.com/microsoft/livedrbench). A detailed discussion of LiveDRBench, including how it was developed and tested, can be found in our [Arxiv paper](https://arxiv.org/abs/2508.04183). ## Usage To use LiveDRBench's questions, you can load the benchmark using the Hugging Face `datasets` library: ```python from datasets import load_dataset livedrbench = load_dataset("microsoft/LiveDRBench", "v1-full")['test'] ``` To evaluate predictions on LiveDRBench, provide a predictions file with the following JSON schema: ``` [ { "key": str, // Unique identifier from livedrbench.csv "preds": List[List[dict | str] | dict] // Predictions in the format specified by each question in livedrbench.csv }, ... ] ``` Then, run the evaluation script in the GitHub repository. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category. ```bash python src/evaluation.py \ --openai_api_key YOUR_API_KEY \ --preds_file path/to/your/predictions.json \ [--openai_model_name gpt-4o] \ [--num_threads 8] \ [--debug] ``` - `--openai_api_key` (required): Your OpenAI API key. - `--preds_file` (required): Path to the predictions JSON file. - `--openai_model_name` (optional): Model to use as judge (default: gpt-4o). - `--num_threads` (optional): Number of parallel threads (default: 8). - `--debug` (optional): Enable debug mode, without multithreading. ## Intended Uses LiveDRBench benchmark is intended to be used together with the Github repository. The code and the benchmark are being shared with the research community to facilitate reproduction of our results and foster further research in this area. LiveDRBench is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them. ## Out-of-scope Uses - LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py). - LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports. - We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes. - LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities. ## Data Creation: Problem Inversion Creating LiveDRBench involves a _problem inversion_ process that allows easy updation with new instances, given a set of existing reasoning problems. The first step is to find a long-context or document reasoning problem that includes a question based on the document and its ground-truth answer. In the second step, this problem is inverted to create a new question asking for an event or entity consistent with the properties mentioned in an answer. In the third step, the question is refined (e.g., more properties are added) such that it admits a unique answer. Finally, the ground-truth set of reference documents is updated in case there are additional documents that provide the same answer. For example, existing data from the [Curie](https://github.com/google/curie) benchmark consists of scientific papers and questions that could be answered based on each paper. The data was transformed to create questions that need to be answered without access to the paper, and thus involving non-trivial search and reasoning. The final ground-truth answers for each question were verified by MSR researchers. While we aim to cover a broad set of scientific fields and world events in the dataset, the dataset primarily covers the fields of materials science, geospatial analysis, and computer science; and world events including flight incidents, the Oscars and Olympiads. We acknowledge that many scientific fields and geographic areas may not be well covered. **Note**: LiveDRBench does not contain links to external data sources. LiveDRBench includes data from an existing scientific dataset, [Curie](https://github.com/google/curie). All queries are answerable using publicly available information. ## Best Practices Best performance can be achieved by connecting an API key directly to the codebase. LiveDRBench should not be the only measure of understanding the performance of a DR model. Additional methods specific to the model use case should also be used to determine the overall performance of the model. We strongly encourage users to use LLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications: - [Blog post on responsible AI features in AOAI that were presented at Ignite 2023](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686) - [Overview of Responsible AI practices for Azure OpenAI models](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview) - [Azure OpenAI Transparency Note](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note) - [OpenAI’s Usage policies](https://openai.com/policies/usage-policies) - [Azure OpenAI’s Code of Conduct](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct) Users are reminded to be mindful of data privacy concerns and are encouraged to review the privacy policies associated with any models and data storage solutions interfacing with LiveDRBench. It is the user’s responsibility to ensure that the use of LiveDRBench repo and dataset complies with relevant data protection regulations and organizational guidelines. ## License Code in this Github repository is licensed under the [MIT License](https://github.com/microsoft/livedrbench/blob/main/LICENSE). The LiveDRBench dataset is released under the CDLA v2 license. ## Contact If you have suggestions or questions, please raise an issue on Github or contact us at amshar@microsoft.com. ## Citing LiveDRBench @inproceedings{livedrbench2025, title={Characterizing Deep Research: A Benchmark and Formal Definition}, author={Java, Abhinav and Khandelwal, Ashmit and Midigeshi, Sukruta and Halfaker, Aaron and Deshpande, Amit and Goyal, Navin and Gupta, Ankur and Natarajan, Nagarajan and Sharma, Amit}, booktitle={arXiv preprint arXiv:2508.04183}, year={2025} }

# LiveDRBench数据集卡片:以主张发现为形式的深度研究 [ArXiv论文](https://arxiv.org/abs/2508.04183) | [Hugging Face数据集](https://huggingface.co/datasets/microsoft/LiveDRBench) | [评估代码](https://github.com/microsoft/LiveDRBench) 我们提出了深度研究(Deep Research, DR)问题的形式化刻画,并推出了全新基准集LiveDRBench,用于评估深度研究系统的性能。为实现客观评估,我们通过一种中间输出表征来定义深度研究:该表征会编码搜索过程中发现的核心主张,将推理挑战与表层报告生成相分离。 ## 数据集详情 该基准集包含100项具有挑战性的深度研究任务,覆盖科学主题(如数据集发现、材料发现、新颖性搜索、现有技术发现)与公共热点事件(如奥斯卡金像奖)。数据采集于2025年5月至6月期间。我们计划持续维护该基准集,并定期发布包含新任务的更新版本。 每个任务包含两部分:(a) 包含任务简要说明与预期输出格式的提示词;(b) 包含应被发现的主张与参考文献的真实标签(ground-truth)JSON文件。我们还提供了一套评估脚本,可通过信息检索指标(精确率、召回率与F1值)计算深度研究系统的性能。 该基准集涵盖8个类别:SciFacts-Geo、SciFacts-Materials、新型数据集识别、新型数据集识别与提取、新型数据集同行检索、现有技术检索、实体与飞行事件。 该基准集的评估代码可从[GitHub](https://github.com/microsoft/livedrbench)获取。 关于LiveDRBench的详细讨论(包括其开发与测试流程)可参阅我们的[ArXiv论文](https://arxiv.org/abs/2508.04183)。 ## 使用方法 如需使用LiveDRBench的任务题,可通过Hugging Face的`datasets`库加载该基准集: python from datasets import load_dataset livedrbench = load_dataset("microsoft/LiveDRBench", "v1-full")['test'] 如需对LiveDRBench上的模型预测结果进行评估,请提供符合以下JSON格式的预测文件: [ { "key": str, // 来自livedrbench.csv的唯一标识符 "preds": List[List[dict | str] | dict] // 符合livedrbench.csv中各任务要求格式的预测结果 }, ... ] 随后运行GitHub仓库中的评估脚本,该脚本将为每个基准集类别计算**精确率**、**召回率**与F1值。 bash python src/evaluation.py --openai_api_key YOUR_API_KEY --preds_file path/to/your/predictions.json [--openai_model_name gpt-4o] [--num_threads 8] [--debug] - `--openai_api_key`(必填):您的OpenAI API密钥。 - `--preds_file`(必填):预测结果JSON文件的路径。 - `--openai_model_name`(可选):用作评判器的模型(默认值:gpt-4o)。 - `--num_threads`(可选):并行线程数(默认值:8)。 - `--debug`(可选):启用调试模式,不使用多线程。 ## 预期用途 LiveDRBench基准集需配合GitHub仓库一同使用。我们将代码与基准集共享给科研社区,旨在方便复现我们的研究结果,并推动该领域的进一步研究。LiveDRBench仅供能够独立评估输出结果质量的领域专家使用。 ## 不适用场景 - LiveDRBench不适用于训练全新的深度研究模型,其仅提供测试集。为避免意外泄露测试集数据,我们参考[BrowseComp基准集发布流程](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py),对基准集中的答案进行了加密处理。 - LiveDRBench数据集无法覆盖所有类型的深度研究查询,尤其是那些需要评估长报告写作质量的查询。 - 我们不建议在未经过进一步测试与开发的情况下,将LiveDRBench仓库或数据集用于商业或实际应用。本项目仅用于科研目的。 - LiveDRBench不得用于高度监管的领域,若输出结果不准确,可能会导致引发人身伤害的决策,或对个人的法律、财务乃至人生机遇造成负面影响。 ## 数据构建:问题反转法 LiveDRBench的构建采用了**问题反转法**,该方法可基于现有推理问题集轻松新增任务实例。第一步:找到一段长上下文或文档推理任务,该任务包含基于该文档及其真实标签答案的问题。第二步:将该任务反转,生成一个新问题,要求找出与答案中提及的属性相符的事件或实体。第三步:对问题进行细化(例如新增更多属性),使其仅存在唯一答案。最后:若存在可提供相同答案的额外参考文献,则更新真实标签的参考文献集合。 例如,[Curie](https://github.com/google/curie)基准集的现有数据包含学术论文与可基于每篇论文作答的问题。我们将这些数据转换为需在无法获取原文的情况下作答的问题,因此涉及复杂的检索与推理过程。每个问题的最终真实标签答案均由微软研究院(Microsoft Research, MSR)的研究人员验证。 尽管我们旨在覆盖广泛的科学领域与全球事件,但本数据集主要涵盖材料科学、地理空间分析与计算机科学领域,以及飞行事件、奥斯卡金像奖与奥林匹克运动会等全球事件。我们承认,本数据集尚未充分覆盖众多科学领域与地理区域。 **注意**:LiveDRBench未包含外部数据源的链接。本数据集包含来自现有学术数据集[Curie](https://github.com/google/curie)的数据。所有查询均可通过公开信息作答。 ## 最佳实践 将API密钥直接接入代码库可获得最佳性能。LiveDRBench不应作为评估深度研究模型性能的唯一标准。还应结合针对模型应用场景的其他方法,来综合评估模型的整体性能。 我们强烈建议用户使用支持完善的负责任人工智能(Responsible AI, RAI)缓解措施的大语言模型(Large Language Model, LLM),例如Azure OpenAI(AOAI)服务。此类服务会持续按照负责任使用的最新行业标准,更新其安全与负责任人工智能缓解措施。有关在脚本与应用中使用基础模型时Azure OpenAI的最佳实践,请参阅: - [关于2023年Ignite大会上展示的Azure OpenAI负责任人工智能功能的博客文章](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686) - [Azure OpenAI模型负责任人工智能实践概述](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview) - [Azure OpenAI透明度说明](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note) - [OpenAI使用政策](https://openai.com/policies/usage-policies) - [Azure OpenAI行为准则](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct) 请用户留意数据隐私问题,并建议用户查阅与接入LiveDRBench的任何模型及数据存储解决方案相关的隐私政策。用户需自行确保LiveDRBench仓库与数据集的使用符合相关数据保护法规及组织指南。 ## 许可协议 本GitHub仓库中的代码采用[MIT许可协议](https://github.com/microsoft/livedrbench/blob/main/LICENSE)授权。LiveDRBench数据集采用CDLA v2许可协议发布。 ## 联系方式 如有建议或疑问,请在GitHub上提交Issue,或发送邮件至amshar@microsoft.com与我们联系。 ## 引用LiveDRBench bibtex @inproceedings{livedrbench2025, title={Characterizing Deep Research: A Benchmark and Formal Definition}, author={Java, Abhinav and Khandelwal, Ashmit and Midigeshi, Sukruta and Halfaker, Aaron and Deshpande, Amit and Goyal, Navin and Gupta, Ankur and Natarajan, Nagarajan and Sharma, Amit}, booktitle={arXiv preprint arXiv:2508.04183}, year={2025} }
提供机构:
maas
创建时间:
2025-08-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作