five

BLUR

收藏
魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/PatronusAI/BLUR
下载链接
链接失效反馈
官方服务:
资源简介:
# Browsing Lost Unformed Recollections ![alt text](BLUR_title.png) The leaderboard can be found at [https://huggingface.co/spaces/PatronusAI/BLUR-leaderboard](https://huggingface.co/spaces/PatronusAI/BLUR-leaderboard). If you use or find this dataset helpful in your research, please do cite our paper: Paper Link: [arXiv](https://arxiv.org/abs/2503.19193) ``` @misc{chwang2025blur, title = {Browsing {Lost} {Unformed} {Recollections}: {A} {Benchmark} for {Tip}-of-the-{Tongue} {Search} and {Reasoning}}, shorttitle = {Browsing {Lost} {Unformed} {Recollections}}, url = {http://arxiv.org/abs/2503.19193}, doi = {10.48550/arXiv.2503.19193}, abstract = {We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98\%), while the best-performing system scores around 56\%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.}, urldate = {2025-03-26}, publisher = {arXiv}, author = {CH-Wang, Sky and Deshpande, Darshan and Muresan, Smaranda and Kannappan, Anand and Qian, Rebecca}, month = mar, year = {2025}, note = {arXiv:2503.19193 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval, Computer Science - Multiagent Systems}, } ``` ## Task Description Have you ever been caught at a loss for words? Where you know something exists—and can describe it—but don’t know the exact or key phrase to search for on Google? The **BLUR (Browsing Lost Unformed Recollections)** benchmark is a search-based aspirational AI general assistant benchmark that aims to create a series of challenges that recreate the experience of searching for information when you only have a vague or incomplete recollection of the target. It focuses on tasks where the user has to piece together fragmented memories, descriptions, or related concepts to find the correct answer, much like trying to recall a word, phrase, or idea when it's on the tip of your tongue. The benchmark evaluates systems in a zero-shot manner based on their ability to: - Handle Ambiguity: Recognize and interpret partial or unclear user input to generate relevant search results or answers. - Contextual Matching: Infer the correct answer from disjointed descriptions and provide responses that align with the user's intended, though imprecisely described, goal. - Reason with Tools: Leverage and reason over multiple calls to its suite of tools to gather and synthesize scattered information into coherent, contextually accurate conclusions. The benchmark measures how well a system can piece together incomplete data using reasoning across its toolset. - Multimodal Reasoning: Interpret and integrate input from different modalities (e.g., text, images, audio) to form a more complete understanding of the user's needs. This challenge tests the ability to combine and reason over multiple types of content, enhancing its search capabilities to retrieve the most relevant results. By addressing these challenges, the BLUR benchmark seeks to assess how well AI systems can support users who are searching based on intuition and fragmented memories rather than precise keywords or phrases. The benchmark aims to foster the development of more intuitive, more realistic, and more memory-aided search technologies that accommodate the natural imperfections of human memory. ## Example Queries I am trying to remember the title of a book I once read. It was published in 2017 and its cover had the image of a snowman looking over some mountains. It had something to do with the search for knowledge. I cannot remember the author of the book but he was a co-author of another book, Conversaciones para Triunfar. What is the title of the book I am looking for? I visited a bank in Ibadan with a friend, but I can’t recall its name or location. It was my first time in Ibadan, so my memory of the place is a bit vague. However, I remember taking a picture of an attractive building located opposite the bank. I’ll attach the picture, can you help me identify the bank and its address? (Image not shown here for display purposes) ## Dataset Structure | Metadata | Description | |-------------------------|-------------------------------------------------------------------------------| | **query** | This field contains the primary query. | | **file** | This field specifies the filepath if a file is attached to the query. | | **scaffolded_query** | This field includes the query wrapped in a consistent prompt scaffold for evaluation. | | **answer** | This field provides the final answer to the query; it is only populated in the validation set. | | **domain** | This field indicates the domain category to which the query belongs. | | **difficulty** | This field represents the level of complexity or difficulty assigned to the query. | | | License | This dataset is distributed under the MIT License. | ## The Benchmark To construct this dataset, we invited annotators to reflect on recent or current instances where they struggled to recall the name of something. Annotators were asked to describe everything they could recall about the item in question, framing it as a prompt they might use to seek help online or from a friend. Annotators were also given the option to upload a file in addition to providing a text input if they wished. We then tasked the writer with locating the item whose name they struggled to recall. Separately, a different annotator (the validator) was challenged to identify the item based solely on the original description prompt provided by the writer. Both annotators were given access to a web browser and documented their search process step by step. If the validator's answer matched the writer's, the prompt was included in the final dataset. Otherwise, we presented the validator with the correct answer along with the writer's search steps and evaluated *posthoc agreement*—whether the validator acknowledged their error and could clearly articulate their mistake. If post hoc agreement was achieved, the prompt was included in the final dataset; otherwise, it was discarded. Prompts were finally minimally edited to standardize formatting and correct typos. **Unambiguous Answers.** The majority of the effort in this two-stage dataset creation process—prompt writing followed by validation—focused on ensuring that the prompts were unambiguous, meaning they led to a single, correct answer. In doing so, we deliberately avoided adversarial dataset construction, as it not only obscures the specific abilities benchmarks aim to measure but also undermines ecological validity. **Multimodal and Multilingual.** While annotators were instructed to write their prompts in English, no language restrictions were placed on the details of the items remembered. Approximately 30% of our dataset is notably multilingual. This includes cases where descriptions are written in other languages or where the descriptions are in one language but the item itself belongs to a different language. Similarly, 35% of our dataset is multimodal on input, featuring prompts accompanied by file attachments rather than being exclusively text-based. Files included sketches of the items recalled, similar images found online, video and audio files in which the item appeared, and more. Note that, in addition to the explicit multimodal understanding required to process file inputs, a majority of queries in BLUR also require reasoning over multimodal sources of information encountered in web searches (images, videos, and more) despite being only text-based. **Ease of Use.** Answers in the dataset are concise, consisting of a single string that can be evaluated for correctness using a weak string match. Prompts are designed for zero-shot answering and evaluation, structured within a question scaffold that constrains the date and time range as well as the answer format. **Difficulty.** The time validators took to answer these queries naturally serves as a proxy of their difficulty level for humans. Based on these times, we divided the dataset into three difficulty levels: easy, for questions resolved in under 10 minutes, medium, for those requiring 10 to 20 minutes, and hard, for those taking over 20 minutes to answer.

# 遗失未成形记忆浏览(Browsing Lost Unformed Recollections,简称BLUR) ![alt text](BLUR_title.png) 排行榜可访问地址为[https://huggingface.co/spaces/PatronusAI/BLUR-leaderboard](https://huggingface.co/spaces/PatronusAI/BLUR-leaderboard)。若您在研究中使用本数据集或认为其对您的研究有所助益,请引用我们的论文: 论文链接:[arXiv](https://arxiv.org/abs/2503.19193) @misc{chwang2025blur, title = {Browsing {Lost} {Unformed} {Recollections}: {A} {Benchmark} for {Tip}-of-the-{Tongue} {Search} and {Reasoning}}, shorttitle = {Browsing {Lost} {Unformed} {Recollections}}, url = {http://arxiv.org/abs/2503.19193}, doi = {10.48550/arXiv.2503.19193}, abstract = {我们推出了「遗失未成形记忆浏览(Browsing Lost Unformed Recollections)」,一款面向通用人工智能助手的舌尖现象(Tip-of-the-Tongue)式已知物品搜索与推理基准测试。BLUR包含573个经过验证的真实世界问题,这些问题需要在多模态、多语言输入的场景下进行搜索与推理,并熟练使用工具,才能取得优异成绩。人类可以轻松完成这些问题(平均得分98%),而表现最佳的系统得分仅约56%。为推动通用人工智能助手这一极具挑战性且具有前瞻性的应用场景的发展,我们通过公开排行榜发布了350个问题,保留其中250个问题的答案,其余作为私密测试集。}, urldate = {2025-03-26}, publisher = {arXiv}, author = {CH-Wang, Sky and Deshpande, Darshan and Muresan, Smaranda and Kannappan, Anand and Qian, Rebecca}, month = mar, year = {2025}, note = {arXiv:2503.19193 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval, Computer Science - Multiagent Systems}, } ## 任务描述 您是否也曾遭遇过话到嘴边却说不出的窘境?明明知晓某一事物的存在,也能对其进行描述,却无法给出精准的关键词以在谷歌上搜索相关内容? **BLUR(Browsing Lost Unformed Recollections,遗失未成形记忆浏览)** 基准测试是一款基于搜索的通用人工智能助手旗舰基准测试,旨在构建一系列挑战,还原用户仅能模糊或不完整地回忆起目标信息时的搜索体验。其聚焦于用户需要拼凑碎片化记忆、描述或相关概念以找到正确答案的任务,恰似在舌尖现象(Tip-of-the-Tongue)中努力回忆单词、短语或想法的场景。 该基准测试以零样本(zero-shot)方式评估系统的以下能力: - 歧义处理:识别并解读部分或模糊的用户输入,生成相关搜索结果或答案。 - 上下文匹配:从零散的描述中推断正确答案,并提供与用户未精准表述的预期目标相符的响应。 - 工具推理:调用并合理使用工具套件进行多轮调用,将分散的信息整合为连贯且符合上下文的准确结论。该基准测试衡量系统如何借助工具集的推理能力拼凑不完整数据。 - 多模态推理:解读并整合来自不同模态(如文本、图像、音频)的输入,以更全面地理解用户需求。该挑战考验系统对多种类型内容的整合与推理能力,以增强搜索能力以检索最相关的结果。 通过应对上述挑战,BLUR基准测试旨在评估人工智能系统在支持基于直觉和碎片化记忆而非精准关键词或短语进行搜索的用户方面的表现。该基准测试旨在推动更直观、更贴合实际、更适配人类记忆自然缺陷的记忆辅助搜索技术的发展。 ## 示例查询 1. 我试图回忆曾读过的一本书的书名。该书于2017年出版,封面印有雪人俯瞰群山的图案,主题与探索知识相关。我不记得作者姓名,但他曾与另一本名为《Conversaciones para Triunfar》的书合著。请问我要找的这本书的书名是什么? 2. 我曾和朋友在伊巴丹(Ibadan)的一家银行办理过业务,但记不起银行的名称和地址。那是我第一次去伊巴丹,所以对该地点的记忆有些模糊。不过我记得在银行对面有一栋颇具吸引力的建筑,并拍了照片。我将附上这张照片,能否帮我识别出该银行及其地址?(此处未展示示例图片) ## 数据集结构 | 元数据(Metadata) | 描述 | |-------------------------|-------------------------------------------------------------------------------| | **query** | 该字段包含主查询内容。 | | **file** | 若查询附带文件,该字段将指定文件路径。 | | **scaffolded_query** | 该字段包含封装在标准化提示框架内的查询,用于评估。 | | **answer** | 该字段提供查询的最终答案,仅在验证集(validation set)中填充。 | | **domain** | 该字段指示查询所属的领域类别。 | | **difficulty** | 该字段代表查询被分配的复杂程度或难度等级。 | | | 许可协议 | 本数据集采用MIT许可协议进行分发。 | ## 基准测试构建 为构建本数据集,我们邀请标注人员回顾近期或当前难以回忆起某一事物名称的经历。要求标注人员尽可能详细地描述该事物的所有可回忆细节,并将其构建为可能用于在线求助或向朋友询问的提示。标注人员还可选择在文本输入之外上传文件。随后,我们要求撰写者定位其难以回忆起名称的目标事物。另外,由另一名不同的标注人员(验证者)仅基于撰写者提供的原始描述提示来识别该事物。两名标注人员均可访问网页浏览器,并逐步记录其搜索过程。若验证者的答案与撰写者的答案一致,则该提示将被纳入最终数据集;若不一致,我们将向验证者展示正确答案以及撰写者的搜索步骤,并评估事后一致性(posthoc agreement)——即验证者是否承认自身错误并能清晰阐明失误原因。若达成事后一致性,则该提示将被纳入最终数据集,否则将被舍弃。最终对提示进行最小化编辑以统一格式并修正拼写错误。 ### 答案唯一性 该两阶段数据集构建流程(先撰写提示再进行验证)的大部分工作都致力于确保提示的唯一性,即提示指向唯一的正确答案。在此过程中,我们刻意避免采用对抗性数据集构建方式,因为这不仅会模糊基准测试旨在衡量的特定能力,还会破坏生态效度(ecological validity)。 ### 多模态与多语言特性 尽管要求标注人员用英语撰写提示,但并未对所回忆事物的细节语言设置限制。值得注意的是,约30%的数据集具备多语言特性,包括描述用其他语言撰写的情况,或描述用一种语言撰写但目标事物属于另一种语言的情况。同样,35%的数据集为多模态输入,即提示附带文件附件,而非仅基于文本。文件包括所回忆事物的草图、在线找到的相似图片、目标事物出现的视频和音频文件等。需注意的是,除处理文件输入所需的显式多模态理解能力外,大部分查询即便仅基于文本,也需要对网络搜索中遇到的多模态信息源(图像、视频等)进行推理。 ### 易用性 数据集中的答案简洁明了,仅需通过弱字符串匹配即可评估正确性。提示专为零样本(zero-shot)问答与评估设计,采用约束了日期时间范围及答案格式的提问框架结构。 ### 难度分级 验证者回答查询所花费的时间可自然作为人类答题难度的代理指标。基于该时间,我们将数据集分为三个难度等级:简单(用时少于10分钟)、中等(用时10至20分钟)以及困难(用时超过20分钟)。
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作