Wikipedia Ontology-Free Graph-Text dataset (WikiOFGraph)

Name: Wikipedia Ontology-Free Graph-Text dataset (WikiOFGraph)
Creator: 韩国浦项科技大学人工智能研究生院和计算机科学与工程系
Published: 2024-09-11 16:16:20
License: 暂无描述

arXiv2024-09-11 更新2024-09-13 收录

下载链接：

https://github.com/daehuikim/WikiOFGraph

下载链接

链接失效反馈

官方服务：

资源简介：

WikiOFGraph是由韩国浦项科技大学人工智能研究生院和计算机科学与工程系创建的一个大规模知识图谱到文本生成数据集。该数据集包含585万条通用领域的图谱-文本对，通过大型语言模型和Data-QuestEval方法生成，不依赖外部本体。数据集的创建过程包括从维基百科中收集句子、使用LLM提取图谱表示，并通过Data-QuestEval进行数据筛选，确保图谱与文本之间的高度一致性。该数据集旨在解决现有数据集在通用领域知识图谱到文本生成任务中的不足，特别是在图谱与文本对齐方面的问题。

WikiOFGraph is a large-scale knowledge graph-to-text generation dataset created by the Graduate School of Artificial Intelligence and the Department of Computer Science and Engineering at Pohang University of Science and Technology. This dataset contains 5.85 million general-domain graph-text pairs, which are generated using Large Language Models (LLMs) and the Data-QuestEval method without relying on external ontologies. The dataset creation process includes collecting sentences from Wikipedia, extracting graph representations via LLMs, and filtering data through Data-QuestEval to ensure high consistency between knowledge graphs and their corresponding texts. This dataset aims to address the shortcomings of existing datasets for general-domain knowledge graph-to-text generation tasks, especially the issues in graph-text alignment.

提供机构：

韩国浦项科技大学人工智能研究生院和计算机科学与工程系

创建时间：

2024-09-11

原始信息汇总

WikiOFGraph 数据集概述

数据来源

数据集通过 Huggingface datasets 提供。
也可以通过手动下载链接获取数据：Download link。

数据加载

使用 datasets 库加载数据集： python from datasets import load_dataset dataset = load_dataset("andreaKIM/WikiOFGraph")

数据处理

数据集生成过程包括数据预处理、图提取和 Data-QuestEval 过滤等步骤。
详细实现代码位于 process 目录中。

实验与分析

实验相关代码位于 experiments 目录中。
定性分析的详细信息和示例输出位于 qualitativeAnalysis 目录中。

搜集汇总

数据集介绍

构建方式

WikiOFGraph数据集的构建方法包括从维基百科中收集源句子，利用大型语言模型（LLM）通过上下文学习从给定句子中提取图表示，并使用Data-QuestEval进行数据筛选，以确保高图-文本一致性。该方法不依赖于外部本体，从而避免了本体数据集常见的图-文本错位问题。

使用方法

使用WikiOFGraph数据集的方法包括将预训练语言模型（PLM）微调到该数据集上，以进行知识图到文本的生成任务。通过微调，模型可以在各种评估指标上超越在其他数据集上训练的模型。

背景与挑战

背景概述

知识图谱到文本（G2T）生成是一项将结构化知识图谱转化为自然语言文本的任务。近年来，预训练语言模型（PLMs）的进步显著提高了G2T性能，但其有效性依赖于具有精确图-文本对齐的数据集。然而，高质量、通用领域的G2T生成数据集的稀缺性限制了该领域的研究进展。为了解决这个问题，我们引入了Wikipedia Ontology-Free Graph-text dataset（WikiOFGraph），这是一个新的、大规模的G2T数据集，采用了一种新颖的方法，该方法利用大型语言模型（LLM）和数据-QuestEval。我们的新数据集包含5.85M通用领域的图-文本对，提供了高图-文本一致性，而无需依赖外部本体。实验结果表明，在WikiOFGraph上微调的PLM在各种评估指标上均优于在其他数据集上训练的PLM。我们的方法被证明是一种可扩展且有效的解决方案，用于生成高质量的G2T数据，显著推动了G2T生成领域的发展。

当前挑战

知识图谱到文本生成的挑战包括构建高质量、通用领域的G2T生成数据集，以及解决图-文本对齐问题。传统的基于模板的方法需要大量的人工工作，而基于神经编码器-解码器架构的方法则依赖于大量高质量的G2T数据。此外，基于本体的数据集通常存在图-文本错位问题，导致难以生成准确的文本。因此，我们需要一种新的方法来生成高质量的G2T数据集，同时解决图-文本对齐问题。

常用场景

经典使用场景

Wikipedia Ontology-Free Graph-Text dataset (WikiOFGraph) is primarily used in the task of Knowledge Graph-to-Text (G2T) generation, where structured knowledge graphs are transformed into natural language text. This dataset is particularly valuable for fine-tuning Pretrained Language Models (PLMs) such as T5 and BART, significantly enhancing their performance in G2T tasks.

解决学术问题

WikiOFGraph addresses the scarcity of high-quality, general-domain G2T datasets, which is a significant challenge in the field of G2T generation. By offering a large-scale dataset with high graph-text consistency and domain diversity, WikiOFGraph enables the development of G2T systems that perform effectively across various domains, thus advancing the field of G2T generation.

实际应用

The practical applications of WikiOFGraph are diverse, ranging from generating descriptive summaries of knowledge graphs for educational purposes to creating detailed reports for data analytics. Its large-scale nature and high consistency make it suitable for training robust G2T models that can be deployed in various real-world scenarios, including content generation for knowledge bases and semantic web applications.

数据集最近研究