five

MR-NIAH

收藏
魔搭社区2026-01-06 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/MiniMax/MR-NIAH
下载链接
链接失效反馈
官方服务:
资源简介:
# Multi-Round Needles-In-A-Haystack (MR-NIAH) Evaluation ## Overview Multi-Round Needles-In-A-Haystack (MR-NIAH) is an evaluation framework designed to assess long-context retrieval performance in large language models (LLMs). It serves as a crucial benchmark for retrieval tasks in long multi-turn dialogue contexts, revealing fundamental capabilities necessary for building lifelong companion AI assistants. MR-NIAH extends the vanilla k-M NIAH (Kamradt, 2023) by creating a more challenging variation specifically tailored to evaluate a model's ability to recall information from earlier parts of a conversation across multiple dialogue rounds. ## Motivation As LLMs are increasingly deployed in applications requiring long-term memory and contextual understanding across extended conversations, the ability to accurately retrieve specific information from earlier dialogue becomes critical. MR-NIAH addresses this need by providing a rigorous evaluation framework that: 1. Tests a model's ability to recall specific information from earlier in a conversation 2. Evaluates performance across varying context lengths (from 2K to 1M tokens) 3. Assesses recall accuracy at different positions within the conversation (25%, 50%, and 75%) 4. Provides a standardized benchmark for comparing different models and retrieval strategies ## Methodology ### Dataset Construction MR-NIAH constructs "haystacks" as history dialogues, where: 1. User queries are synthetic but explicit requests for event descriptions and creative writing 2. Each query and its corresponding response are injected at specific positions (25%, 50%, and 75%) of the conversation 3. In the final round, the user requests the model to repeat a specific response from one of the earlier requests 4. The haystacks span from 2K to 1M tokens (up to approximately 2000 interactions) ### Evaluation Metrics The evaluation focuses on the model's ability to accurately recall the requested information. Each ground truth response contains three core components, and the evaluation measures an adjusted recall score based on the model's ability to reproduce these components. The scoring is implemented in `score.py`, which: 1. Processes model responses 2. Compares them against ground truth responses 3. Calculates an adjusted recall score based on the presence of key components ## Dataset Structure The dataset is organized by language and token length: ``` data/ ├── english/ │ ├── 2048_tokens.jsonl │ ├── 10240_tokens.jsonl │ ├── ... │ └── 1024000_tokens.jsonl └── chinese/ ├── 2048_tokens.jsonl ├── 10240_tokens.jsonl ├── ... └── 1024000_tokens.jsonl ``` Each JSONL file contains evaluation examples with the following structure: ```json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ... {"role": "user", "content": "Please repeat the [specific content] you mentioned earlier"} ], "label": "The expected response that should be recalled", "length_class": 2048 } ``` ## Usage ### Running Evaluations Please refer to our GitHub page https://github.com/MiniMax-AI/MiniMax-01/tree/main/evaluation/MR-NIAH. ### Interpreting Results The evaluation produces scores that indicate: - Overall recall performance across different context lengths - Performance at different injection points (25%, 50%, 75%) - Comparative performance against other models ## License This evaluation framework is released under the same license as the MiniMax-01 repository.

# 多轮干草堆寻针(Multi-Round Needles-In-A-Haystack,MR-NIAH)评估框架 ## 概述 多轮干草堆寻针(MR-NIAH)是一款专为评估大语言模型(Large Language Model,LLM)长上下文检索性能而设计的评估框架。它是面向长多轮对话场景下检索任务的关键基准测试,能够揭示构建终身陪伴式AI智能体所需的核心能力。 MR-NIAH 拓展了基础版k-M NIAH(Kamradt,2023),构建了更具挑战性的变体,专门用于评估模型在多轮对话场景下,从对话早期段落中召回信息的能力。 ## 设计动机 随着大语言模型(LLM)愈发广泛地应用于需要长期记忆与长对话上下文理解的场景中,从早期对话中精准召回特定信息的能力变得至关重要。MR-NIAH 通过提供一套严谨的评估框架来满足这一需求,该框架可实现: 1. 测试模型从对话早期召回特定信息的能力 2. 评估不同上下文长度(2K至1M Token)下的模型性能 3. 评估对话中不同位置(25%、50%、75%)的召回准确率 4. 为不同模型与检索策略的横向对比提供标准化基准 ## 评估方法 ### 数据集构建 MR-NIAH 将对话历史构建为「干草堆」,具体规则如下: 1. 用户查询为人工合成的明确请求,涵盖事件描述与创意写作类任务 2. 每条查询及其对应的回复会被嵌入对话的特定位置(25%、50%、75%处) 3. 在最终轮次中,用户会要求模型复述某一条早期请求对应的特定回复 4. 「干草堆」的上下文长度覆盖2K至1M Token(最多包含约2000轮交互) ### 评估指标 本次评估聚焦于模型精准召回目标信息的能力。每条基准真实回复包含三个核心组成部分,评估将基于模型复现这些组成部分的能力,计算调整后的召回分数。 评分逻辑在`score.py`中实现,其流程包括: 1. 处理模型生成的回复 2. 将其与基准真实回复进行比对 3. 根据关键组成部分的存在情况,计算调整后的召回分数 ## 数据集结构 数据集按照语言与Token长度进行组织,目录结构如下: data/ ├── english/ │ ├── 2048_tokens.jsonl │ ├── 10240_tokens.jsonl │ ├── ... │ └── 1024000_tokens.jsonl └── chinese/ ├── 2048_tokens.jsonl ├── 10240_tokens.jsonl ├── ... └── 1024000_tokens.jsonl 每个JSONL文件包含若干评估样本,样本结构如下: json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ... {"role": "user", "content": "请复述你此前提及的[特定内容]"} ], "label": "需召回的预期回复内容", "length_class": 2048 } ## 使用方式 ### 运行评估 请参考我们的GitHub页面:https://github.com/MiniMax-AI/MiniMax-01/tree/main/evaluation/MR-NIAH。 ### 结果解读 本次评估生成的分数可反映以下信息: - 不同上下文长度下的整体召回性能 - 不同信息嵌入位置(25%、50%、75%)下的模型性能 - 与其他模型的横向对比性能 ## 许可证 本评估框架与MiniMax-01仓库采用相同的开源许可证。
提供机构:
maas
创建时间:
2025-03-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作