five

Text2KGBench

收藏
arXiv2023-08-04 更新2024-06-21 收录
下载链接:
https://github.com/cenguix/Text2KGBench
下载链接
链接失效反馈
官方服务:
资源简介:
Text2KGBench是一个用于评估语言模型从文本生成知识图谱能力的基准数据集,由IBM研究院欧洲爱尔兰分院的研究人员创建。该数据集包含两个子集:Wikidata-TekGen和DBpedia-WebNLG,分别基于Wikidata和DBpedia的本体结构,涵盖了电影、音乐、体育等多个领域。Wikidata-TekGen包含10个本体和13,474个句子,而DBpedia-WebNLG则有19个本体和4,860个句子。数据集的创建过程涉及手动构建本体、使用SPARQL查询生成三元组,并通过远监督技术与Wikipedia句子对齐。Text2KGBench旨在通过这些数据集评估语言模型在遵循特定本体的前提下,从文本中提取事实的能力,适用于自然语言处理和语义网领域的研究,特别是在知识图谱的自动构建和完善方面。

Text2KGBench is a benchmark dataset for evaluating the capability of language models to generate knowledge graphs from text, created by researchers from IBM Research Europe, Ireland. The dataset comprises two subsets: Wikidata-TekGen and DBpedia-WebNLG, which are respectively based on the ontology structures of Wikidata and DBpedia, covering multiple domains including film, music, sports and others. Wikidata-TekGen contains 10 ontologies and 13,474 sentences, while DBpedia-WebNLG includes 19 ontologies and 4,860 sentences. The dataset creation process involves manually constructing ontologies, generating triples via SPARQL queries, and aligning with Wikipedia sentences through distant supervision techniques. Text2KGBench is designed to evaluate the ability of language models to extract factual knowledge from text while adhering to specific ontologies, and is applicable to research in the fields of natural language processing and the Semantic Web, particularly in the automatic construction and refinement of knowledge graphs.
提供机构:
IBM研究院欧洲爱尔兰分院
创建时间:
2023-08-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作