Wikidata Reference

Name: Wikidata Reference
Creator: figshare
Published: 2025-05-01 11:06:04
License: 暂无描述

DataCite Commons2025-05-01 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/Wikidata_Reference/28602170/2

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): <code>TRUE</code>/<code>FALSE</code>Subject mentioned in Text: <code>TRUE</code>/<code>FALSE</code>Relation mentioned in Text: <code>TRUE</code>/<code>FALSE</code>Object mentioned in Text: <code>TRUE</code>/<code>FALSE</code>Fact Entailed By Text: <code>TRUE</code>/<code>FALSE</code>Final Answer: <code>TRUE</code>/<code>FALSE</code>reference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.

数据集概述三元组-文本对齐数据集（Triple-to-Text Alignment dataset）将维基数据（Wikidata）中的知识图谱（Knowledge Graph, KG）三元组与从网络中提取的多样化真实文本来源进行对齐。与此前主要依赖维基百科文本的数据集不同，本数据集通过利用来自新闻文章、政府报告、科学文献等多种来源的维基数据参考资料，涵盖了更广泛的写作风格、语调与结构。研究人员使用大语言模型（Large Language Model, LLM）提取并验证与知识图谱三元组对应的文本片段，确保对齐质量。该数据集可用于训练和评估关系抽取（Relation Extraction, RE）系统以及知识图谱构建系统。数据字段数据集中的每一行包含以下字段： subject (str)：知识图谱三元组的主体实体。 rel (str)：连接主体与客体的关系。 object (str)：知识图谱三元组的客体实体。 text (str)：蕴含给定三元组的自然语言语句。 validation (str)：基于大语言模型的验证结果，包含：语句通顺性：<code>TRUE</code>/<code>FALSE</code> 文本中提及主体：<code>TRUE</code>/<code>FALSE</code> 文本中提及关系：<code>TRUE</code>/<code>FALSE</code> 文本中提及客体：<code>TRUE</code>/<code>FALSE</code> 文本蕴含该事实：<code>TRUE</code>/<code>FALSE</code> 最终判定结果：<code>TRUE</code>/<code>FALSE</code> reference_url (str)：提取该文本的网络来源URL。 subj_qid (str)：主体实体的维基数据QID。 rel_id (str)：该关系对应的维基数据属性ID。 obj_qid (str)：客体实体的维基数据QID。数据集构建流程本数据集通过以下步骤构建： 1. 三元组-参考样本采样与提取使用SPARQL查询提取维基数据中的全部关系，为每种关系采集带有关联参考URL的知识图谱三元组样本。 2. 领域分析与网络爬取将URL按领域分组，对采样页面进行分析以确定其主要语言。随后爬取并处理英文网页，提取纯文本内容。 3. 基于大语言模型的文本片段选择与验证使用大语言模型从网络内容中识别与知识图谱三元组对应的文本片段。采用思维链（Chain-of-Thought, CoT）提示方法，验证提取的文本是否蕴含该三元组。验证流程涵盖语句通顺性、主体提及情况、关系提及情况、客体提及情况以及最终事实蕴含性检查。 4. 最终数据集统计共分析12.5万个维基数据关系，得到330万条三元组-参考对。在筛选出英文内容后，使用大语言模型处理了45.8万条三元组-网络内容对。最终数据集共包含8.05万个经过验证的三元组-文本对齐样本。

提供机构：

figshare

创建时间：

2025-03-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集