florin-hf/wiki_dump2018_no_duplicates
收藏Hugging Face2024-07-03 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/florin-hf/wiki_dump2018_no_duplicates
下载链接
链接失效反馈官方服务:
资源简介:
这是一个清理和去重后的英文维基百科数据集,来源于2018年12月20日的维基百科转储。最初来源于DPR仓库,经过处理去除了重复项,最终包含20,970,784个段落,每个段落由100个单词组成。该数据集用于支持一项关于RAG系统中基础与指导大型语言模型比较的研究。
This is a cleaned and de-duplicated version of the English Wikipedia dump dated December 20, 2018. Originally sourced from the DPR repository, it has been processed to remove duplicates, resulting in a final count of 20,970,784 passages, each consisting of 100 words. The dataset is used to support experiments comparing base and instruct Large Language Models within Retrieval-Augmented Generation systems.
提供机构:
florin-hf



