five

mrcr

收藏
魔搭社区2026-05-23 更新2025-04-19 收录
下载链接:
https://modelscope.cn/datasets/openai-mirror/mrcr
下载链接
链接失效反馈
官方服务:
资源简介:
# OpenAI MRCR: Long context multiple needle in a haystack benchmark OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset for benchmarking an LLM's ability to distinguish between multiple needles hidden in context. This eval is inspired by the MRCR eval first introduced by Gemini (https://arxiv.org/pdf/2409.12640v2). OpenAI MRCR expands the tasks's difficulty and provides opensource data for reproducing results. The task is as follows: The model is given a long, multi-turn, synthetically generated conversation between user and model where the user asks for a piece of writing about a topic, e.g. "write a poem about tapirs" or "write a blog post about rocks". Hidden in this conversation are 2, 4, or 8 identical asks, and the model is ultimately prompted to return the i-th instance of one of those asks. For example, "Return the 2nd poem about tapirs". ### Example conversation for 2 needle case: ``` User: Write a poem about tapirs Assistant: (first poem about tapirs) User: Write a blog post about rocks Assistant: (first blog post about rocks) User: Write a poem about tapirs Assistant: (second poem about tapir) User: Write a social media post about tapirs Assistant: (first social media post about tapirs) User: Write a blog post about rocks Assistant: (second blog post about rocks) User: Prepend aYooSG8CQg to the 2nd (1 indexed) poem about tapirs. Do not include any other text in your response. Assistant: aYooSG8CQg(2nd poem about tapirs) ``` This eval is challenging because: - The needles are selected from the same distribution as the distractors. All assistant responses are generated by gpt4o, so the needle blends in with the haystack. - The model must distinguish order amongst the needles. - The more needles, the harder the task. - The longer the context, the harder the task. ### Implementation details - The measured metric is the SequenceMatcher ratio as implemented in https://docs.python.org/3/library/difflib.html. - The model must prepend an alphanumeric hash to the beginning of its answer. If this hash is not included, the match ratio is set to 0. If it is correctly included, the stripped sampled answer is compared to the stripped ground truth answer. - There are 438 distinct entities and 10 distinct writing formats. - There are 100 samples per bin. - Bins are determined by number of tokens used by prompt + answer in the sample. - Bin boundaries are: [4096, 8192], (8192, 16384], (16384, 32768], (32768, 65536], (65536, 131072], (131072, 262144], (262144, 524288], (524288, 1048576] # Results See OpenAI's blog post (https://openai.com/index/gpt-4-1/) for full results on this benchmark. # How to run Below is a code snippet for running and grading this task: ```python from huggingface_hub import hf_hub_download import pandas as pd from openai import OpenAI import json from difflib import SequenceMatcher import tiktoken # Set accordingly MAX_CONTEXT_WINDOW= 1000000 MODEL= "gpt-4.1" dataset = pd.concat([pd.read_parquet( hf_hub_download(repo_id="openai/mrcr", filename="2needle/2needle_0.parquet", repo_type="dataset") ), pd.read_parquet( hf_hub_download(repo_id="openai/mrcr", filename="2needle/2needle_1.parquet", repo_type="dataset") )]) client = OpenAI() enc = tiktoken.get_encoding("o200k_base") def grade(response, answer, random_string_to_prepend) -> float: """ Compare response and answer. """ if not response.startswith(random_string_to_prepend): return 0 response = response.removeprefix(random_string_to_prepend) answer = answer.removeprefix(random_string_to_prepend) return float(SequenceMatcher(None, response, answer).ratio()) def n_tokens(messages : list[dict]) -> int: """ Count tokens in messages. """ return sum([len(enc.encode(m["content"])) for m in messages]) for index, row in dataset.iterrows(): messages = json.loads(row["prompt"]) if n_tokens(messages) > MAX_CONTEXT_WINDOW: continue completion = client.chat.completions.create( model=MODEL, messages=messages, ) response = completion.choices[0].message.content print(grade(response, row["answer"], row["random_string_to_prepend"])) ``` # Changelog - 4/12/2025: Initial dataset published - 12/5/2025: Bugfix: A bug during generation caused ~10% of datapoints to contain too many target needles, and ~5% of datapoints to contain incorrect ground truth. New versions of the incorrect datapoints were uploaded; a "date_added" field is included to indicate which datapoints were modified. Credit to @[dillonu](https://github.com/Dillonu) for discovery of consistently failing samples.

# OpenAI MRCR:长上下文多“干草堆寻针”基准数据集 OpenAI MRCR(多轮共指消解(Multi-round co-reference resolution))是一款长上下文基准数据集,用于评测大语言模型(Large Language Model,LLM)区分隐藏在长上下文内的多个目标文本的能力。 该评测任务的灵感源自Gemini首次提出的MRCR评测(https://arxiv.org/pdf/2409.12640v2)。OpenAI MRCR提升了任务难度,并提供了可复现结果的开源数据。 任务说明如下:模型将获得一段长的、多轮的、人工合成的用户与模型间对话,其中用户会请求撰写某一主题的文本,例如“创作一首关于貘的诗歌”或“撰写一篇关于岩石的博客文章”。这段对话中隐藏了2、4或8次完全相同的请求,最终模型需要按照提示返回其中第i次请求对应的文本。例如,“返回第2首关于貘的诗歌”。 ### 2个“针”示例对话: 用户:创作一首关于貘的诗歌 助手:(第一首关于貘的诗歌) 用户:撰写一篇关于岩石的博客文章 助手:(第一篇关于岩石的博客文章) 用户:创作一首关于貘的诗歌 助手:(第二首关于貘的诗歌) 用户:撰写一篇关于貘的社交媒体帖子 助手:(第一篇关于貘的社交媒体帖子) 用户:撰写一篇关于岩石的博客文章 助手:(第二篇关于岩石的博客文章) 用户:将aYooSG8CQg添加到第2首(从1开始索引)关于貘的诗歌的开头,且回复中不得包含其他任何文本。 助手:aYooSG8CQg(第二首关于貘的诗歌) 该评测任务具有挑战性,原因如下: - 目标文本(needle)与干扰文本(distractors)取自同一分布。所有助手回复均由GPT-4o生成,因此目标文本会与上下文(haystack,即整体对话内容)高度融合。 - 模型必须准确区分各目标文本的出现顺序。 - 隐藏的目标文本数量越多,任务难度越高。 - 上下文长度越长,任务难度越高。 ### 实现细节 - 评测采用的指标为SequenceMatcher匹配度(SequenceMatcher),其实现来自https://docs.python.org/3/library/difflib.html。 - 模型的回复必须以字母数字组合的哈希值作为前缀。若未包含该哈希值,则匹配度直接设为0;若正确包含,则移除前缀后的采样回复将与移除前缀后的标准答案进行比对。 - 数据集包含438个不同的主题实体与10种不同的文本创作格式。 - 每个分箱(bin)包含100个样本。 - 分箱依据样本中提示词与回复的总Token(Token)数划分。 - 分箱边界为:[4096, 8192]、(8192, 16384]、(16384, 32768]、(32768, 65536]、(65536, 131072]、(131072, 262144]、(262144, 524288]、(524288, 1048576] # 评测结果 完整评测结果请参阅OpenAI官方博客(https://openai.com/index/gpt-4-1/)。 # 运行方法 以下为运行并评测该任务的代码片段: python from huggingface_hub import hf_hub_download import pandas as pd from openai import OpenAI import json from difflib import SequenceMatcher import tiktoken # Set accordingly MAX_CONTEXT_WINDOW= 1000000 MODEL= "gpt-4.1" dataset = pd.concat([pd.read_parquet( hf_hub_download(repo_id="openai/mrcr", filename="2needle/2needle_0.parquet", repo_type="dataset") ), pd.read_parquet( hf_hub_download(repo_id="openai/mrcr", filename="2needle/2needle_1.parquet", repo_type="dataset") )]) client = OpenAI() enc = tiktoken.get_encoding("o200k_base") def grade(response, answer, random_string_to_prepend) -> float: """ Compare response and answer. """ if not response.startswith(random_string_to_prepend): return 0 response = response.removeprefix(random_string_to_prepend) answer = answer.removeprefix(random_string_to_prepend) return float(SequenceMatcher(None, response, answer).ratio()) def n_tokens(messages : list[dict]) -> int: """ Count tokens in messages. """ return sum([len(enc.encode(m["content"])) for m in messages]) for index, row in dataset.iterrows(): messages = json.loads(row["prompt"]) if n_tokens(messages) > MAX_CONTEXT_WINDOW: continue completion = client.chat.completions.create( model=MODEL, messages=messages, ) response = completion.choices[0].message.content print(grade(response, row["answer"], row["random_string_to_prepend"])) # 更新日志 - 2025年4月12日:首次发布数据集 - 2025年12月5日:修复漏洞。生成过程中存在一处漏洞,导致约10%的数据点包含超出规定数量的目标文本,另有约5%的数据点存在标准答案错误。现已上传修正后的数据点版本,并新增“date_added”字段以标记被修改的数据点。感谢@[dillonu](https://github.com/Dillonu)发现持续失败的样本。
提供机构:
maas
创建时间:
2025-04-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作