KGText
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/wenhuchen/KGPT
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个从维基百科构建的去噪知识基础语料库,包含了基于与知识图谱实体词汇重叠选出的句子-子图对。该数据集用于预训练一个模型,该模型可以针对各种数据到文本生成任务进行微调,且数据集已通过筛选以减少噪音并提升相关性。规模上,我们从原始的1200万维基百科语料库中筛选出了700万“优质”句子。该数据集的任务是进行知识基础的语言模型预训练。
This dataset is a denoised knowledge-grounded corpus constructed from Wikipedia, which contains sentence-subgraph pairs selected based on lexical overlap with knowledge graph entities. It is designed to pretrain models that can be fine-tuned for various data-to-text generation tasks, and has been filtered to reduce noise and improve relevance. In terms of scale, we have filtered out 7 million "high-quality" sentences from the original 12 million Wikipedia corpus. The task of this dataset is to conduct knowledge-grounded language model pretraining.
提供机构:
Constructed by the authors from Wikipedia and WikiData



