KGText

Name: KGText
Creator: Constructed by the authors from Wikipedia and WikiData
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/wenhuchen/KGPT

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个从维基百科构建的去噪知识基础语料库，包含了基于与知识图谱实体词汇重叠选出的句子-子图对。该数据集用于预训练一个模型，该模型可以针对各种数据到文本生成任务进行微调，且数据集已通过筛选以减少噪音并提升相关性。规模上，我们从原始的1200万维基百科语料库中筛选出了700万“优质”句子。该数据集的任务是进行知识基础的语言模型预训练。

This dataset is a denoised knowledge-grounded corpus constructed from Wikipedia, which contains sentence-subgraph pairs selected based on lexical overlap with knowledge graph entities. It is designed to pretrain models that can be fine-tuned for various data-to-text generation tasks, and has been filtered to reduce noise and improve relevance. In terms of scale, we have filtered out 7 million "high-quality" sentences from the original 12 million Wikipedia corpus. The task of this dataset is to conduct knowledge-grounded language model pretraining.

提供机构：

Constructed by the authors from Wikipedia and WikiData

5,000+

优质数据集

54 个

任务类型

进入经典数据集