Automatically Generated Chemical KG
收藏DataCite Commons2025-06-01 更新2025-05-07 收录
下载链接:
https://figshare.com/articles/dataset/Automatically_Generated_Chemical_KG/27234222/2
下载链接
链接失效反馈官方服务:
资源简介:
Abstract - In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.DOI of publication: 10.1021/acs.jcim.4c01904Description of data and formatting: This data represents the chemical network information extracted as a result of the efforts described within the corresponding paper. There are 2 files present in this repo. The <i>small_SSKG_sample.json </i>file is just a small snapshot of the larger file for use in unit testing. The <i>Full_SSKG.json </i>file is the complete graph extracted from the patent literature. The files are in JSON format and are intended to be loaded within Python as dictionaries. The <i>Full_SSKG.json </i>file is approximately 11GB in size when extracted. import json<br>with open('path/to/KGfile.json', 'r') as fp:data = json.load(fp)<br>ents = data["entities"] # this is a list of dictionary type items<br>rels = data["relations"] # this is a list of dictionary type itemsprint(ents[0])<br>print(rels[0])<br>Recommended visualization software: Cytoscape<br>The open source Cytoscape program was found to produce the most easily read graphics and was least likely to crash.<br><br>02/05/2025 Note: The full KG file should be uploaded within a week. <br>02/11/2025 Note: Full_SSKG.json is uploaded. <br><br><br>
摘要——2020年,全球共发表近300万份科学与工程学术论文(White, K. 出版物产出:美国趋势与国际对比)。现有学术文献体量庞大、新文献发表速率持续攀升,加之人工智能方法正快速向科学与工程领域渗透,推动了从技术文档中挖掘与提取信息的自动化方法发展。然而,面向领域特定的语义信息知识图谱(Knowledge Graph)表示抽取方法仍较为有限。本文提出一种自然语言处理(Natural Language Processing, NLP)方法,用于抽取知识图谱,构建可查询的语义结构化网络(Semantically Structured Network, SSN)。在详细阐述建模方法后,本文针对约10万份完整专利文本中的有机分子合成化学场景,对所提方法进行了具体演示。本文重点聚焦知识图谱的特征刻画,以挖掘数据内的语言模式与趋势,并建立客观的图谱特征,以便跨领域的其他文本型知识图谱开展对比研究。本文针对网络模体结构、同配性以及特征向量中心性开展图谱特征刻画工作。上述度量所提供的结构信息,揭示了化学领域作者在文本论述中常用的语言倾向:包括特定化合物名称描述的普遍性、常见溶剂与干燥剂被大量化学合成方法所复用,以及在更大规模语料库中清晰呈现的幂律分布趋势。本研究结果为大规模数据场景下的知识图谱验证提供了重要的定量刻画依据。
本研究发表DOI:10.1021/acs.jcim.4c01904
数据与格式说明:本数据集对应论文中所述工作所提取的化学网络信息。本仓库中共包含2个文件:<i>small_SSKG_sample.json</i>为完整文件的小型快照,用于单元测试;<i>Full_SSKG.json</i>为从专利文献中提取的完整图谱。所有文件均采用JSON格式,可在Python中作为字典加载。解压后,<i>Full_SSKG.json</i>文件大小约为11GB。
示例代码:
python
import json
with open("path/to/KGfile.json", "r") as fp:
data = json.load(fp)
ents = data["entities"] # 该字段为字典类型元素组成的列表
rels = data["relations"] # 该字段为字典类型元素组成的列表
print(ents[0])
print(rels[0])
推荐可视化软件:Cytoscape
经测试,开源工具Cytoscape可生成最易读取的可视化图形,且崩溃概率最低。
2025年2月5日备注:完整知识图谱文件将在一周内上传。
2025年2月11日备注:Full_SSKG.json已完成上传。
提供机构:
figshare
创建时间:
2025-02-11



