MemoryAgentBench
收藏魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench
下载链接
链接失效反馈官方服务:
资源简介:
# 🚧 Update
- [x] (Sep 29th, 2025) We updated our paper, where we removed some in-efficient and high-cost samples. We also added a sub-sample of DetectiveQA.
- [x] (July 7th, 2025) We released the initial version of our datasets.
- [x] (July 22nd, 2025) We modify the datasets slightly, adding the keypoints in LRU and change the ```uuid``` into ```qa_pair_ids```. The ```question_ids``` is only used in Longmemeval task.
- [x] (July 26th, 2025) We fixed bug on ```qa_pair_ids```.
- [x] (Aug.5th, 2025) We removed the ```ruler_niah``` and some other datasets not used in main experiments. We will release a subset for ablation study in future.
# ⚙️ MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
This repository contains the MemoryAgentBench dataset, designed for evaluating the memory capabilities of LLM agents.
📄 Paper: https://arxiv.org/pdf/2507.05257
💻 Code: https://github.com/HUST-AI-HYZ/MemoryAgentBench
MemoryAgentBench is a unified benchmark framework for comprehensively evaluating the memory capabilities of LLM agents: through four core competencies (Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Conflict Resolution) and incremental multi-turn interaction design, it reveals existing limitations and shortcomings of current memory agents and compares performance differences across various memory agents.
## Four Core Competencies for Evaluation
What capabilities does AI need to truly "remember"? We argue that merely storing and retrieving information is far from sufficient. The memory system needs to possess four key competencies:
### 1. Accurate Retrieval (AR)
This is the most fundamental capability—precisely **locating required information** from massive dialogue histories. For instance, when you ask about a detail mentioned 3 hours ago after hours of conversation with an AI, can it quickly and accurately find it? This requires not only single-hop retrieval but also multi-hop reasoning capabilities.
### 2. Test-Time Learning (TTL)
Truly intelligent systems should be able to continuously **learn new skills during interactions**. For example, if you teach an AI a new classification method through a few examples, can it flexibly apply this in subsequent conversations? This "learning-while-using" capability is crucial for building adaptive AI.
### 3. Long-Range Understanding (LRU)
Unlike fragmented information retrieval, long-range understanding requires AI to form **global cognition**. Just like after reading a novel, you not only remember specific plot points but also understand the overall narrative and character relationships. AI needs to abstract high-level understanding from long conversations.
### 4. Conflict Resolution (CR)
Information in the real world is dynamic. When users say "I changed jobs" or "this theory has been disproven," AI must **identify and update** outdated information rather than simply accumulating old and new knowledge.
## Careful Dataset Design
From "feeding data" to "simulating real interactions," MemoryAgentBench demonstrates ingenuity in dataset design: The research team both adapted existing datasets and created two new ones. All data is split into chunks to **simulate real multi-turn interaction scenarios**—just like your daily conversations with an AI assistant, where information accumulates gradually rather than being injected all at once.
### 1. Newly Constructed Datasets:
**EventQA:** Requires AI to understand temporal event chains in novels and predict "what happens next".
**FactConsolidation:** Specifically designed to test conflict resolution capabilities, including single-hop and multi-hop difficulty levels.
Notably, the team adopted a **"inject once, query multiple times"** design philosophy—one long text corresponds to multiple questions, significantly improving evaluation efficiency.
### 2. Unified Evaluation Protocol:
Memory Construction Phase → Incremental chunk input → Build/Update memory
Query Execution Phase → Pose questions → Answer based on memory → Evaluate accuracy
## Key Findings 🔍
### 1. RAG is Not a Silver Bullet 🎯
RAG shows clear advantages in accurate retrieval tasks—even simple BM25 methods significantly outperform the GPT-4o-mini baseline (100% vs 22.8% on NIAH-MQ task). However, they have a fatal weakness: poor performance on tasks requiring global understanding, as RAG can only retrieve local information fragments.
### 2. Long Context ≠ Universal Solution 🔑
Although GPT-4.1-mini supports million-level tokens, it doesn't achieve top performance across various tasks. For instance, it only achieves 45.8% accuracy on ∞Bench-QA, and computational overhead increases linearly with context length.
### 3. Commercial Systems Fall Short of Expectations 😔
Three primary factors lead to poor performance of commercial memory systems. First, severe information loss—Mem0 compresses information by extracting "facts," resulting in substantial context loss. Second, limited retrieval mechanisms—while MemGPT supports multiple retrieval iterations, it lacks temporal and structural metadata. Third, absence of global perspective—these methods cannot reconstruct complete documents, performing particularly poorly on long-range understanding tasks.
### 4. Conflict Resolution Remains Challenging ⚠️
For single-hop conflict resolution, memory agents built with GPT-4o achieve only 60% accuracy. In multi-hop conflict resolution scenarios, all methods achieve single-digit accuracy rates (at most 7%), highlighting this as a critical bottleneck for current memory systems.
### 5. Ablation Studies Reveal Optimization Directions 🔬
**Balancing Chunk Size**: Smaller chunks (512 tokens) benefit accurate retrieval tasks (RULER-QA accuracy reaches 90%), while larger chunks (4096 tokens) better preserve semantic coherence for continuous text understanding. Dynamic chunk size adjustment based on task type is recommended.
**Marginal Effects of Top-K**: Increasing K from 2 to 10 yields significant performance gains for accurate retrieval tasks (BM25 improves from 49.5% to 61%), but shows limited impact on learning tasks, indicating that simply increasing retrieval volume is not a panacea.
**Computational Latency Gaps**: The computational overhead difference between simple and complex systems is staggering—Mem0's memory construction time is 20,000x that of BM25. When using 512 tokens for memory input, Cognee requires 3.3 hours to process a single long-context sample. From a practical deployment perspective, commercial systems must find a balance between performance and efficiency.
## Conclusion 📌
MemoryAgentBench demonstrates significant progress in systematically evaluating LLM memory mechanisms—through comprehensive assessment of four core competencies, it reveals for the first time the limitations of current state-of-the-art methods in dynamic memory updates and long-range consistency, providing a standardized evaluation framework for building AI agents with genuine memory capabilities. In future, we will **collect more realistic real-world conversation data** to further enrich the benchmark's diversity and authenticity, and explore comprehensive memory architectures that can balance accurate retrieval, test-time learning, long-range understanding, and conflict resolution.
## Sample Usage
```python
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("ai-hyz/MemoryAgentBench")
# Access a specific split, e.g., 'Accurate_Retrieval'
accurate_retrieval_split = dataset["Accurate_Retrieval"]
print(f"Number of examples in Accurate_Retrieval split: {len(accurate_retrieval_split)}")
print(f"First example from Accurate_Retrieval split: {accurate_retrieval_split[0]}")
# Access another split, e.g., 'Test_Time_Learning'
test_time_learning_split = dataset["Test_Time_Learning"]
print(f"Number of examples in Test_Time_Learning split: {len(test_time_learning_split)}")
print(f"First example from Test_Time_Learning split: {test_time_learning_split[0]}")
```
# 🚧 更新日志
- [x] (2025年9月29日) 我们更新了论文,移除了部分低效且高成本的样本,同时新增了DetectiveQA的子样本集。
- [x] (2025年7月7日) 我们发布了数据集的初始版本。
- [x] (2025年7月22日) 我们对数据集进行了小幅修改,新增了长程理解(Long-Range Understanding,LRU)任务的关键点,并将`uuid`修改为`qa_pair_ids`。`question_ids`仅在Longmemeval任务中使用。
- [x] (2025年7月26日) 我们修复了`qa_pair_ids`相关的漏洞。
- [x] (2025年8月5日) 我们移除了`ruler_niah`以及其他未在主实验中使用的数据集。我们将于未来发布用于消融研究的子集。
# ⚙️ MemoryAgentBench:基于增量多轮交互的大语言模型智能体记忆能力评估基准
本仓库包含MemoryAgentBench数据集,该数据集专为评估大语言模型(Large Language Model,LLM)智能体的记忆能力而设计。
📄 论文:https://arxiv.org/pdf/2507.05257
💻 代码仓库:https://github.com/HUST-AI-HYZ/MemoryAgentBench
MemoryAgentBench是一个用于全面评估大语言模型智能体记忆能力的标准化基准框架:通过四大核心能力(精准检索、测试时学习、长程理解与冲突解决)以及增量多轮交互设计,该基准能够揭示当前记忆型智能体存在的局限与不足,并对比不同记忆型智能体的性能差异。
## 四大核心评估能力
人工智能要实现真正的「记忆」,需要具备哪些能力?我们认为,仅实现信息的存储与检索远远不够,记忆系统需具备四大核心能力:
### 1. 精准检索(Accurate Retrieval,AR)
这是最基础的核心能力——从海量对话历史中精准定位所需信息。例如,在与AI进行数小时的对话后,你询问三小时前提及的某一细节,它能否快速且准确地找到该信息?这不仅需要单跳检索能力,还需具备多跳推理能力。
### 2. 测试时学习(Test-Time Learning,TTL)
真正的智能系统应当能够在交互过程中持续学习新技能。例如,若你通过少量示例向AI传授一种全新的分类方法,它能否在后续对话中灵活应用该方法?这种「边使用边学习」的能力对于构建自适应AI至关重要。
### 3. 长程理解(Long-Range Understanding,LRU)
与碎片化的信息检索不同,长程理解要求AI形成全局认知。正如阅读完一部小说后,你不仅能记住具体的情节节点,还能理解整体叙事脉络与人物关系。AI需要从冗长的对话中提炼出高层次的认知。
### 4. 冲突解决(Conflict Resolution,CR)
现实世界中的信息是动态变化的。当用户提及「我换工作了」或「该理论已被证伪」时,AI必须识别并更新过时信息,而非简单地新旧知识堆叠。
## 精心设计的数据集
从「数据投喂」到「真实交互模拟」,MemoryAgentBench在数据集设计上颇具巧思:研究团队既对现有数据集进行了适配改造,也全新构建了两个数据集。所有数据均被切分为多个数据块,以模拟真实的多轮交互场景——正如你日常与AI助手的对话那样,信息是逐步累积的,而非一次性全部注入。
### 1. 全新构建的数据集
**EventQA**:要求AI理解小说中的时间事件链,并预测「后续发展」。
**FactConsolidation**:专为测试冲突解决能力而设计,涵盖单跳与多跳两种难度层级。
值得注意的是,团队采用了「一次注入、多次查询」的设计理念——一篇长文本对应多个问题,显著提升了评估效率。
### 2. 标准化评估流程
记忆构建阶段 → 增量数据块输入 → 构建/更新记忆;查询执行阶段 → 提出问题 → 基于记忆作答 → 准确率评估
## 核心研究发现 🔍
### 1. 检索增强生成并非万能灵药 🎯
检索增强生成(Retrieval-Augmented Generation,RAG)在精准检索任务中展现出明显优势——即使是简单的BM25方法,在NIAH-MQ任务上的表现也显著优于GPT-4o-mini基线模型(准确率分别为100%与22.8%)。但RAG存在致命缺陷:在需要全局理解的任务中表现不佳,因为它仅能检索局部信息片段。
### 2. 长上下文并非通用解决方案 🔑
尽管GPT-4.1-mini支持百万级上下文窗口,但它在各类任务中并未取得最优性能。例如,它在∞Bench-QA任务上的准确率仅为45.8%,且计算开销随上下文长度线性增长。
### 3. 商用系统未达预期 😔
商用记忆系统性能不佳主要源于三大因素:其一,严重的信息丢失——Mem0通过提取「事实」来压缩信息,导致大量上下文丢失;其二,检索机制受限——尽管MemGPT支持多次检索迭代,但它缺乏时序与结构元数据;其三,缺乏全局视角——这类方法无法重建完整文档,在长程理解任务上表现尤其糟糕。
### 4. 冲突解决仍是核心挑战 ⚠️
在单跳冲突解决任务中,基于GPT-4o构建的记忆型智能体准确率仅为60%。而在多跳冲突解决场景中,所有方法的准确率均为个位数(最高仅7%),这表明冲突解决是当前记忆系统的关键瓶颈。
### 5. 消融研究揭示优化方向 🔬
**数据块尺寸平衡**:较小的数据块(512个Token)对精准检索任务更友好(RULER-QA任务准确率可达90%),而较大的数据块(4096个Token)则能更好地保留连续文本理解所需的语义连贯性。因此,建议根据任务类型动态调整数据块尺寸。
**Top-K参数的边际效应**:将Top-K值从2提升至10,可使精准检索任务的性能获得显著提升(BM25的准确率从49.5%提升至61%),但对学习任务的影响有限。这表明单纯增加检索量并非万能解决方案。
**计算延迟差距**:简单与复杂系统的计算开销差异悬殊——Mem0的记忆构建时间是BM25的20000倍。当采用512个Token作为记忆输入时,Cognee处理单个长上下文样本需要3.3小时。从实际部署角度来看,商用系统必须在性能与效率之间找到平衡。
## 总结 📌
MemoryAgentBench在系统化评估大语言模型记忆机制方面取得了重要进展:通过对四大核心能力的全面评估,该基准首次揭示了当前主流方法在动态记忆更新与长程一致性方面的局限,为构建具备真正记忆能力的AI智能体提供了标准化评估框架。未来,我们将收集更多贴合现实的对话数据,进一步丰富基准的多样性与真实性,并探索能够兼顾精准检索、测试时学习、长程理解与冲突解决的综合记忆架构。
## 示例用法
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("ai-hyz/MemoryAgentBench")
# 访问指定拆分,例如 'Accurate_Retrieval'
accurate_retrieval_split = dataset["Accurate_Retrieval"]
print(f"Accurate_Retrieval拆分的样本数量:{len(accurate_retrieval_split)}")
print(f"Accurate_Retrieval拆分的首个样本:{accurate_retrieval_split[0]}")
# 访问其他拆分,例如 'Test_Time_Learning'
test_time_learning_split = dataset["Test_Time_Learning"]
print(f"Test_Time_Learning拆分的样本数量:{len(test_time_learning_split)}")
print(f"Test_Time_Learning拆分的首个样本:{test_time_learning_split[0]}")
提供机构:
maas
创建时间:
2025-10-13
搜集汇总
数据集介绍

背景与挑战
背景概述
MemoryAgentBench是一个专注于评估大型语言模型(LLM)代理记忆能力的基准数据集,通过四个核心能力(精确检索、测试时学习、长程理解和冲突解决)和增量多轮交互设计,全面测试和比较不同记忆代理的性能。数据集包含新构建的子集和统一的评估协议,模拟真实交互场景,为研究记忆机制提供了标准化框架。
以上内容由遇见数据集搜集并总结生成



