HESML V2R1 Java software library of semantic similarity measures for the biomedical domain

Name: HESML V2R1 Java software library of semantic similarity measures for the biomedical domain
Creator: e-cienciaDatos
Published: 2025-11-12 09:19:10
License: 暂无描述

DataCite Commons2025-11-12 更新2025-04-10 收录

下载链接：

https://edatos.consorciomadrono.es/citation?persistentId=doi:10.21950/AQLSMV

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset introduces HESML V2R1 which is the sixth release of the Half-Edge Semantic Measures Library (HESML) detailed in [24]. HESML V2R1 is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH, GO and any other ontologies based on the OBO file format. HESML V2R1 also implements most of the sentence similarity methods in the biomedical domain together with a set of sentence pre-processing configurations, the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]. HESML V2R1 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature, as well as the evaluation of three pre-trained word embedding models for the general domain and 33 pre-trained embeddings and language models. It also provides a XML-based input file format in order to specify the execution of reproducible word/concept similarity experiments based on WordNet, SNOMED-CT, MeSH, or GO without software coding, and the necessary software clients to run the sentence-based experiments in the biomedical domain. HESML V2R1 introduces the following novelties: (1) the software implementation of a new package for the evaluation of sentence similarity methods; (2) the software implementation of most of the sentence similarity methods in the biomedical domain; (3) the implementation of a new package for sentence pre-processing together with a set of sentence pre-processing configurations; (4) the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]; (5) the software implementation of a parser based on the averaging Simple Word EMbeddings (SWEM) models introduced by Shen et al. [32] for efficiently loading and evaluating FastText-based [4] and other word embedding models; (6) the integration of Python wrappers for the evaluation of BERT [8], Universal Sentence Encoder (USE) [5] and Flair [1] models; and finally, (7) the software implementation of a new string-based sentence similarity method based on the aggregation of the Li et al. [29] similarity and Block distance [9] measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al. [33], and a new pre-trained word embedding model based on FastText [4] and trained on the full-text of the articles in the PMC-BioC corpus [6]. HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the two mains HESML papers [24] as attribution requirement.However, HESML distribution also includes other datasets, databases or data files whose use require the attribution acknowledgement by any user of HEMSL. Thus, we urge to the HESML users to fulfill with licensing terms related to other resources distributed with the library as detailed in its companion release notes.

提供机构：

e-cienciaDatos

创建时间：

2022-02-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集