giantfish-fly/coreference-challenge
收藏Hugging Face2025-10-21 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/giantfish-fly/coreference-challenge
下载链接
链接失效反馈官方服务:
资源简介:
PI-LLM数据集旨在评估大型语言模型在上下文干扰和检索能力方面的表现。该数据集通过一系列的键值对更新,要求模型检索最后更新的值,以此来测试模型在处理上下文干扰方面的能力。README中描述了两种配置,core配置具有随机化的更新,而sequential_additional配置具有非随机化的、顺序的更新。该数据集用于PI-LLM基准测试,已集成到Moonshot AI的内部基准测试框架中。README提到,该数据集及其评估方法受到了认知科学的启发,特别是对人类工作记忆的研究。它还指出,该数据集已被用于测试各种SOTA大型语言模型,包括GPT-5、Grok-4和Gemini,揭示了随着更新数量的增加,模型在检索最后值方面的失败一致性。
The PI-LLM dataset is designed to evaluate the context interference and retrieval capacity of Language Large Models (LLMs). The dataset challenges LLMs to retrieve the last updated value from a series of key-value pairs, with increasing numbers of updates to test their ability to handle context interference. The README describes two configurations, core with randomized updates and sequential_additional with non-randomized, sequential updates. The dataset is used in the PI-LLM Bench, which has been integrated into Moonshot AIs internal benchmarking framework. The README mentions that the dataset and its evaluation methods are inspired by cognitive science, particularly the study of human working memory. It also notes that the dataset has been used to test various SOTA LLMs, including GPT-5, Grok-4, and Gemini, revealing consistent failures in retrieving the last value as the number of updates increases.
提供机构:
giantfish-fly



