five

KGText

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/wenhuchen/KGPT
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个从维基百科构建的去噪知识基础语料库,包含了基于与知识图谱实体词汇重叠选出的句子-子图对。该数据集用于预训练一个模型,该模型可以针对各种数据到文本生成任务进行微调,且数据集已通过筛选以减少噪音并提升相关性。规模上,我们从原始的1200万维基百科语料库中筛选出了700万“优质”句子。该数据集的任务是进行知识基础的语言模型预训练。

This dataset is a denoised knowledge-grounded corpus constructed from Wikipedia, which contains sentence-subgraph pairs selected based on lexical overlap with knowledge graph entities. It is designed to pretrain models that can be fine-tuned for various data-to-text generation tasks, and has been filtered to reduce noise and improve relevance. In terms of scale, we have filtered out 7 million "high-quality" sentences from the original 12 million Wikipedia corpus. The task of this dataset is to conduct knowledge-grounded language model pretraining.
提供机构:
Constructed by the authors from Wikipedia and WikiData
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作