florin-hf/wiki_dump2018_nq_open
收藏Wikipedia Dump with Gold Documents from Natural Questions
数据集概述
该数据集结合了2018年12月20日的英文维基百科转储和来自Natural Questions数据集的黄金段落,特别针对开放域问答任务进行了定制。通过整合与NQ-open数据集中每个查询相对应的黄金文档,该资源解决了维基百科转储与NQ-open中的问答对之间可能存在的不匹配问题。这种不匹配可能导致转储中不包含所需的答案。通过应用彻底的重复过滤过程,确保了每个查询的黄金文档的精确识别,从而提高了数据集在自然语言处理任务中的可靠性。
因此,该数据集可以作为RAG系统的知识库。在数据集准备的一个关键方面,涉及解决大型语言模型(LLMs)关于输入大小的限制。LLMs在处理单个提示中的多个文档时,面临输入长度的限制。为了适应这一点,超过512个令牌(使用Llama2进行标记化)的黄金文档被排除在数据集之外。这一决定旨在最大化可以包含在LLM提示中的文档数量,而不影响每个文档提供的细节或上下文。最终,该数据集包含21,035,236个文档(13.9 GB)。
数据集来源
-
原始维基百科转储:语料库源自英文维基百科转储,文章被分割成100个单词的非重叠段落。下载链接。
-
黄金段落:源自Natural Questions数据集,这些段落被整合以提供全面的问答资源。黄金段落可通过以下URL访问:
上述数据来自Dense Passage Retrieval (DPR)的GitHub仓库。
数据集结构
维基百科段落的示例如下: json { "text": "Home computers were a class of microcomputers entering the market in 1977, and becoming common during the 1980s. They were marketed to consumers as affordable and accessible computers that, for the first time, were intended for the use of a single nontechnical user. These computers were a distinct market segment that typically cost much less than business, scientific or engineering-oriented computers of the time such as the IBM PC, and were generally less powerful in terms of memory and expandability. However, a home computer often had better graphics and sound than contemporary business computers. Their most common uses were playing", "title": "Home computer" }




