Two complementary AI approaches for predicting UMLS semantic group assignment: heuristic reasoning and deep learning

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.dfn2z356z

下载链接

链接失效反馈

官方服务：

资源简介：

Objective: Use heuristic, deep learning (DL), and hybrid AI methods to predict semantic group (SG) assignments for new UMLS Metathesaurus atoms, with target accuracy ≥ 95%. Materials and Methods: We used train-test datasets from successive 2020AA-2022AB UMLS Metathesaurus releases. Our heuristic “waterfall" approach employed a sequence of seven different SG prediction methods. Atoms not qualifying for a method were passed on to the next method. The DL approach generated BioWordVec and SapBERT embeddings for atom names, BioWordVec embeddings for source vocabulary names, and BioWordVec embeddings for atom names of the second-to-top nodes of an atom’s source hierarchy. We fed a concatenation of the four embeddings into a fully connected multi-layer neural network with an output layer of 15 nodes (one for each SG). Both methods were capable of estimating the probability that their predicted SG for an atom would be correct. We developed two hybrid SG prediction methods combining the strengths of heuristic and DL methods. Results: The heuristic waterfall approach accurately predicted 94.3% of SGs for 1,563,692 new unseen atoms. The DL accuracy on the same dataset was also 94.3%. The hybrid approaches achieved an average accuracy of 96.5%. Conclusion: Our study demonstrated that AI methods can predict SG assignments for new UMLS atoms with sufficient accuracy to be potentially useful as an intermediate step in the time-consuming task of assigning new atoms to UMLS concepts (CUIs). We showed that for SG prediction, combining heuristic methods and DL methods can produce better results than either alone.

研究目标：采用启发式、深度学习（DL）及混合人工智能（hybrid AI）方法，针对新增统一医学语言系统（UMLS Metathesaurus）原子预测其语义组（SG）分配结果，目标准确率≥95%。材料与方法：本研究使用连续发布的2020AA至2022AB版统一医学语言系统（UMLS Metathesaurus）数据集作为训练测试集。我们的启发式“瀑布”方法采用7种不同的语义组预测方法依次执行，未通过某一方法筛选的原子将被传递至下一方法。深度学习方法针对原子名称生成BioWordVec与SapBERT嵌入向量，针对源词汇表名称生成BioWordVec嵌入向量，并针对原子源层级中次顶层节点对应的原子名称生成BioWordVec嵌入向量。将上述4种嵌入向量的拼接结果输入至全连接多层神经网络，该网络输出层包含15个节点，每个节点对应一个语义组。两种方法均可估算其预测的原子语义组分配结果的正确概率。本研究开发了2种结合启发式方法与深度学习方法优势的混合语义组预测方法。研究结果：针对1,563,692个全新未观测原子，启发式瀑布方法的语义组预测准确率达94.3%；深度学习方法在同一数据集上的准确率同样为94.3%；混合方法的平均准确率达到96.5%。研究结论：本研究证实，人工智能方法可实现对新增统一医学语言系统原子语义组分配的精准预测，其准确率足以作为将新原子分配至统一医学语言系统概念（CUIs）这一耗时任务的中间步骤，具备潜在应用价值。同时证明，针对语义组预测任务，结合启发式方法与深度学习方法可获得优于单一方法的预测效果。

创建时间：

2023-07-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集