five

knowledge_consistency_of_LLMs

收藏
魔搭社区2025-12-18 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/knowledge_consistency_of_LLMs
下载链接
链接失效反馈
官方服务:
资源简介:
## What it is: Each dataset in this delivery is made up of query clusters that test an aspect of the consistency of the LLM knowledge about a particular domain. All the questions in each cluster are meant to be answered either 'yes' or 'no'. When the answers vary within a cluster, the knowledge is said to be inconsistent. When all the questions in a cluster are answered 'no' when the expected answer is 'yes' (or viceversa), the knowledge is said to be 'incomplete' (i.e., maybe the LLM wasn't trained in that particular domain). It is our experience that incomplete clusters are very few (less than 3%) meaning that the LLMs we have tested know about the domains included here (see below for a list of the individual datasets), as opposed to inconsistent clusters, which can be between 6%-20% of the total clusters. The image below indicates the types of edges the query clusters are supposed to test. It is worth noting that these correspond to common sense axioms about conceptualization, like the fact that subConceptOf is transitive (4) or that subconcepts inherit the properties of their parent concepts (5). These axioms are listed in the accompanying paper (see below) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c80841d418013c77d9f1cd/Kdx6_qaipaZvbJKQZ_M9Y.png) ## How it is made: The questions and clusters are automatically generated from a knowledge graph from seed concepts and properties. In our case, we have used Wikidata, a well known knowledge graph. The result is an RDF/OWL subgraph that can be queried and reasoned over using Semantic Web technology. The figure below summarizes the steps used. The last two steps refer to a possible use case for this dataset, including using in-context learning to improve the performance of the dataset. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c80841d418013c77d9f1cd/McMdDv_0IzBzrlrVMPfWs.png) ## Types of query clusters There are different types of query clusters depending on what aspect of the knowledge graph and its deductive closure they capture: Edge clusters test a single edge using different questions. For example, to test the edge ('orthopedic pediatric surgeon', IsA, 'orthopedic surgeon), the positive or 'edge_yes' (expected answer is 'yes') cluster is: "is 'orthopedic pediatric surgeon' a subconcept of 'orthopedic surgeon' ?", "is 'orthopedic pediatric surgeon' a type of 'orthopedic surgeon' ?", "is every kind of 'orthopedic pediatric surgeon' also a kind of 'orthopedic surgeon' ?", "is 'orthopedic pediatric surgeon' a subcategory of 'orthopedic surgeon' ?" There are also inverse edge clusters (with questions like "is 'orthopedic surgeon' a subconcept of 'orthopedic pediatric surgeon' ?") and negative or 'edge_no' clusters (with questions like "is 'orthopedic pediatric surgeon' a subconcept of 'dermatologist' ?") Hierarchy clusters measure the consistency of a given path, including n-hop virtual edges (in graph's the deductive closure). For example, the path ('orthopedic surgeon', 'surgeon', 'medical specialist', 'medical occupation') is tested by the cluster below "is 'orthopedic surgeon' a subconcept of 'surgeon' ?", "is 'orthopedic surgeon' a type of 'surgeon' ?", "is every kind of 'orthopedic surgeon' also a kind of 'surgeon' ?", "is 'orthopedic surgeon' a subcategory of 'surgeon' ?", "is 'orthopedic surgeon' a subconcept of 'medical specialist' ?", "is 'orthopedic surgeon' a type of 'medical specialist' ?", "is every kind of 'orthopedic surgeon' also a kind of 'medical specialist' ?", "is 'orthopedic surgeon' a subcategory of 'medical specialist' ?", "is 'orthopedic surgeon' a subconcept of 'medical_occupation' ?", "is 'orthopedic surgeon' a type of 'medical_occupation' ?", "is every kind of 'orthopedic surgeon' also a kind of 'medical_occupation' ?", "is 'orthopedic surgeon' a subcategory of 'medical_occupation' ?" Property inheritance clusters test the most basic property of conceptualization. If an orthopedic surgeon is a type of surgeon, we expect that all the properties of surgeons, e.g., having to be board certified, having attended medical school or working on the field of surgery, are inherited by orthopedic surgeons. The example below tests the later: "is 'orthopedic surgeon' a subconcept of 'surgeon' ?", "is 'orthopedic surgeon' a type of 'surgeon' ?", "is every kind of 'orthopedic surgeon' also a kind of 'surgeon' ?", "is 'orthopedic surgeon' a subcategory of 'surgeon' ?", "is the following statement true? 'orthopedic surgeon works on the field of surgery' ", "is the following statement true? 'surgeon works on the field of surgery' ", "is it accurate to say that 'orthopedic surgeon works on the field of surgery'? ", "is it accurate to say that 'surgeon works on the field of surgery'? " ## List of datasets To show the versatility of our approach, we have constructed similar datasets in the domains below. We test one property inheritance per dataset. The Wikidata main QNode (the node corresponding to the entities) and PNode (the node corresponding to the property) are indicated in parenthesis. | domain | top concept | WD concept | main property | WD property | |----- | ----- | -----| ----- | ----- | | Academic Disciplines | "Academic Discipline" | https://www.wikidata.org/wiki/Q11862829 | "has use" | https://www.wikidata.org/wiki/Property:P366 | | Dishes | "Dish" | https://www.wikidata.org/wiki/Q746549 | "has parts" | https://www.wikidata.org/wiki/Property:P527 | | Financial products | "Financial product" | https://www.wikidata.org/wiki/Q15809678 | "used by" | https://www.wikidata.org/wiki/Property:P1535 | | Home appliances | "Home appliance" | https://www.wikidata.org/wiki/Q212920 | "has use" | https://www.wikidata.org/wiki/Property:P366 | | Medical specialties | "Medical specialty" | https://www.wikidata.org/wiki/Q930752 | "field of occupation" | https://www.wikidata.org/wiki/Property:P425 | | Music genres | "Music genre" | https://www.wikidata.org/wiki/Q188451 | "practiced by" | https://www.wikidata.org/wiki/Property:P3095 | | Natural disasters | "Natural disaster" | https://www.wikidata.org/wiki/Q8065 | "has cause" | https://www.wikidata.org/wiki/Property:P828 | | Software | "Software" | https://www.wikidata.org/wiki/Q7397 | "studied in" | https://www.wikidata.org/wiki/Property:P7397 | The size and configuration of the datasets is listed below | domain | edges_yes | edges_no | edges_in | hierarchies | property hierarchies | | ------------------- | :----: | :-----: | :-----: | :-----: | :-----: | | Academic Disciplines | 52 | 308 | 52 | 30 | 1 | | Dishes | 197 | 519 | 197 | 62 | 121 | | Financial product | 112 | 433 | 108 | 40 | 32 | | Home appliances | 58 | 261 | 58 | 31 | 13 | | Medical specialties | 122 | 386 | 114 | 55 | 63 | | Music genres | 490 | 807 | 488 | 212 | 139 | | Natural disasters | 45 | 225 | 44 | 21 | 22 | | Software | 80 | 572 | 79 | 114 | 4 | ## Want to know more? For background and motivation on this dataset, please check https://arxiv.org/abs/2405.20163 Also to be published in COLM 2024, @inproceedings{Uceda_2024_1, <br/> &ensp; title={Reasoning about concepts with LLMs: Inconsistencies abound}, <br/> &ensp; author={Rosario Uceda Sosa and Karthikeyan Natesan Ramamurthy and Maria Chang and Moninder Singh}, <br/> &ensp; booktitle={Proc.\ 1st Conference on Language Modeling (COLM 24)}, <br/> &ensp; year={2024} <br/> } ## Questions? Comments? Please contact rosariou@us.ibm.com, knatesa@us.ibm.com, Maria.Chang@ibm.com or moninder@us.ibm.com

## 数据集概述 本批次交付的所有数据集均由查询簇(query clusters)构成,用于测试大语言模型(Large Language Model, LLM)对特定领域知识的一致性表现。每个簇内的所有问题均需以“是”或“否”作答。若同一簇内的答案存在分歧,则表明该模型的知识存在不一致性。若簇内所有问题的预期答案应为“是”,但模型全部回答“否”(反之亦然),则称该模型的知识存在“不完备性”——即该大语言模型或许未在对应领域接受过训练。根据我们的测试经验,不完备的查询簇占比极低(不足3%),这意味着我们测试的大语言模型均掌握本次数据集涵盖的领域知识(各细分数据集清单详见下文);而不一致的查询簇占比则介于总簇数的6%至20%之间。 下图展示了查询簇拟测试的边类型。值得注意的是,这些测试对应于概念化的常识公理,例如子概念关系(subConceptOf)具有传递性(编号4),或子概念会继承父概念的属性(编号5)。这些公理已在随附的论文中列出(详见下文)。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c80841d418013c77d9f1cd/Kdx6_qaipaZvbJKQZ_M9Y.png) ## 数据集构建方式 本数据集的问题与查询簇均基于知识图谱,从种子概念与属性自动生成。本次测试中,我们采用了知名的维基数据(Wikidata)知识图谱,最终得到可通过语义网技术进行查询与推理的RDF/OWL子图。下图总结了具体的构建步骤,最后两步则对应本数据集的一种潜在应用场景,例如利用上下文学习(in-context learning)提升模型在该数据集上的表现。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c80841d418013c77d9f1cd/McMdDv_0IzBzrlrVMPfWs.png) ## 查询簇类型 根据查询簇所捕捉的知识图谱及其演绎闭包的维度不同,可分为多种类型: ### 边测试簇 边测试簇通过不同问题测试单条边。例如,为测试边('orthopedic pediatric surgeon', IsA, 'orthopedic surgeon'),正向(即'edge_yes',预期答案为“是”)的测试簇包含如下问题: > "‘骨科儿科医师’是否为‘骨科医师’的子概念?", > "‘骨科儿科医师’是否属于‘骨科医师’的类别?", > "所有‘骨科儿科医师’是否均属于‘骨科医师’的范畴?", > "‘骨科儿科医师’是否为‘骨科医师’的子类别?" 此外还存在反向边测试簇(问题如“‘骨科医师’是否为‘骨科儿科医师’的子概念?”)与负向边测试簇(即'edge_no'簇,问题如“‘骨科儿科医师’是否为‘皮肤科医师’的子概念?”) ### 层级测试簇 层级测试簇用于评估给定路径(包括图谱演绎闭包中的n跳虚拟边)的一致性。例如,路径(‘orthopedic surgeon’, ‘surgeon’, ‘medical specialist’, ‘medical occupation’)的测试簇如下: > "‘骨科医师’是否为‘外科医师’的子概念?", > "‘骨科医师’是否属于‘外科医师’的类别?", > "所有‘骨科医师’是否均属于‘外科医师’的范畴?", > "‘骨科医师’是否为‘外科医师’的子类别?", > "‘骨科医师’是否为‘医疗专科医师’的子概念?", > "‘骨科医师’是否属于‘医疗专科医师’的类别?", > "所有‘骨科医师’是否均属于‘医疗专科医师’的范畴?", > "‘骨科医师’是否为‘医疗专科医师’的子类别?", > "‘骨科医师’是否为‘医疗职业’的子概念?", > "‘骨科医师’是否属于‘医疗职业’的类别?", > "所有‘骨科医师’是否均属于‘医疗职业’的范畴?", > "‘骨科医师’是否为‘医疗职业’的子类别?" ### 属性继承测试簇 属性继承测试簇用于测试概念化的最基本属性:若某类概念是另一类概念的子类,则前者应当继承后者的所有属性。例如,若骨科医师是外科医师的子类,则外科医师的所有属性(如需获得执业执照、接受过医学院教育、从事外科领域工作等)均应被骨科医师继承。下方示例即测试该逻辑: > "‘骨科医师’是否为‘外科医师’的子概念?", > "‘骨科医师’是否属于‘外科医师’的类别?", > "所有‘骨科医师’是否均属于‘外科医师’的范畴?", > "‘骨科医师’是否为‘外科医师’的子类别?", > "‘骨科医师从事外科领域工作’这一表述是否属实?", > "‘外科医师从事外科领域工作’这一表述是否属实?", > "‘骨科医师从事外科领域工作’这一说法是否准确?", > "‘外科医师从事外科领域工作’这一说法是否准确?" ## 数据集清单 为展示本方法的通用性,我们在以下多个领域构建了同类数据集,每个数据集仅测试一种属性继承关系。括号内标注了对应的维基数据主Q节点(实体对应节点)与P节点(属性对应节点)。 | 领域 | 顶层概念 | WD概念链接 | 核心属性 | WD属性链接 | |---------------------|-------------------|--------------------------------------------------------------------------------|-------------------|--------------------------------------------------------------------------------| | 学术学科 | “学术学科” | https://www.wikidata.org/wiki/Q11862829 | “具有用途” | https://www.wikidata.org/wiki/Property:P366 | | 餐食菜品 | “餐食” | https://www.wikidata.org/wiki/Q746549 | “包含组成部分” | https://www.wikidata.org/wiki/Property:P527 | | 金融产品 | “金融产品” | https://www.wikidata.org/wiki/Q15809678 | “被……使用” | https://www.wikidata.org/wiki/Property:P1535 | | 家用电器 | “家用电器” | https://www.wikidata.org/wiki/Q212920 | “具有用途” | https://www.wikidata.org/wiki/Property:P366 | | 医学专科 | “医学专科” | https://www.wikidata.org/wiki/Q930752 | “职业领域” | https://www.wikidata.org/wiki/Property:P425 | | 音乐流派 | “音乐流派” | https://www.wikidata.org/wiki/Q188451 | “被……演绎” | https://www.wikidata.org/wiki/Property:P3095 | | 自然灾害 | “自然灾害” | https://www.wikidata.org/wiki/Q8065 | “具有成因” | https://www.wikidata.org/wiki/Property:P828 | | 软件产品 | “软件” | https://www.wikidata.org/wiki/Q7397 | “研究领域” | https://www.wikidata.org/wiki/Property:P7397 | ### 数据集规模与配置 各数据集的规模与配置如下表所示: | 领域 | 正向边数 | 负向边数 | 反向边数 | 层级簇数 | 属性继承簇数 | |---------------------|---------|---------|---------|---------|-------------| | 学术学科 | 52 | 308 | 52 | 30 | 1 | | 餐食菜品 | 197 | 519 | 197 | 62 | 121 | | 金融产品 | 112 | 433 | 108 | 40 | 32 | | 家用电器 | 58 | 261 | 58 | 31 | 13 | | 医学专科 | 122 | 386 | 114 | 55 | 63 | | 音乐流派 | 490 | 807 | 488 | 212 | 139 | | 自然灾害 | 45 | 225 | 44 | 21 | 22 | | 软件产品 | 80 | 572 | 79 | 114 | 4 | ## 更多信息 如需了解本数据集的背景与设计动机,请查阅论文https://arxiv.org/abs/2405.20163,该论文已被2024年第一届语言建模会议(Conference on Language Modeling, COLM 2024)收录。 bibtex @inproceedings{Uceda_2024_1, title={Reasoning about concepts with LLMs: Inconsistencies abound}, author={Rosario Uceda Sosa and Karthikeyan Natesan Ramamurthy and Maria Chang and Moninder Singh}, booktitle={Proc. 1st Conference on Language Modeling (COLM 24)}, year={2024} } ## 疑问与反馈 如有疑问或建议,请联系以下邮箱:rosariou@us.ibm.com、knatesa@us.ibm.com、Maria.Chang@ibm.com 或 moninder@us.ibm.com
提供机构:
maas
创建时间:
2025-10-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作