NarrativeXL
收藏arXiv2023-12-08 更新2024-06-21 收录
下载链接:
https://github.com/r-seny/NarrativeXL
下载链接
链接失效反馈官方服务:
资源简介:
NarrativeXL是一个大规模的阅读理解数据集,包含近百万个问题,平均文档长度超过50,000字,适用于长期记忆模型的训练和评估。该数据集利用GPT-3.5从Project Gutenberg的1500本手工精选小说中总结每个场景,每本书约产生150个场景级总结。随后,基于这些总结创建了多种阅读理解问题,包括三种类型的多选场景识别问题以及自由形式的叙事重建问题。数据集的关键特点是大多数问题具有已知的“保留需求”,指示回答这些问题所需的长期记忆程度,有助于评估长期记忆性能。此外,数据集还提供了代码,以便以最小的劳动力成本进一步扩展数据集。该数据集适用于开发和评估需要处理极长上下文的语言模型,旨在解决现有模型在处理长文本时性能下降的问题。
NarrativeXL is a large-scale reading comprehension dataset comprising nearly one million questions, with an average document length exceeding 50,000 words. It is designed for training and evaluating long-term memory models. This dataset utilizes GPT-3.5 to summarize individual scenes from 1,500 hand-selected novels sourced from Project Gutenberg, generating roughly 150 scene-level summaries per novel. Subsequently, a variety of reading comprehension questions are developed based on these summaries, encompassing three types of multiple-choice scene recognition questions as well as free-form narrative reconstruction questions. A core feature of this dataset is that most questions have predefined "retention demands", which indicate the degree of long-term memory required to answer them, aiding in the evaluation of long-term memory performance. Additionally, the dataset provides accompanying code to enable further expansion of the dataset with minimal labor costs. This dataset is applicable to the development and evaluation of language models that need to handle extremely long contexts, aiming to address the performance degradation issue of existing models when processing long texts.
提供机构:
圣菲研究所
创建时间:
2023-05-23
搜集汇总
背景与挑战
背景概述
NarrativeXL是一个大规模、超长上下文的阅读理解数据集,包含990,595个问题,平均文档长度超过50,000词,用于评估长期记忆模型的性能。数据集中的问题标注了“记忆需求”指标,可帮助诊断模型记忆能力,且问题对现代语言模型具有挑战性。数据通过自动化流程生成,支持扩展和复现研究结果。
以上内容由遇见数据集搜集并总结生成



