Experimental data for computing semantic similarity between concepts using multiple Inheritances in Wikipedia category graph
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://data.mendeley.com/datasets/hnmb43sj5s
下载链接
链接失效反馈官方服务:
资源简介:
In this data article, we provide experimental data to compute the semantic similarity between the concepts (words) taken from the gold standard word similarity benchmarks MC30 (English), RG65 (Spanish), and RG65 (French). This data is related to the multiple inheritance-based semantic similarity methods proposed in In M. J. Hussain, et al.
The dataset contains four folders named as "Benchmarks_results_graphs", "French_RG65", "MC30", and "Spanish_RG65" respectively. The folder "Benchmarks_results_graphs" contains the Pearson correlation values of the experimental results of English (MC30), French (RG65), and Spanish (RG65) benchmarks. The Folders “French_RG65”, “MC30”, and “Spanish_RG65” have all the necessary pre-processed data files to execute the python based program to compute the semantic similarity between French, English, and Spanish Wikipedia concepts according to our methods. For example, the folder “French_RG65” contains: (1) the experiments on RG65 (French) benchmark in the sub-folder named as “French_RG65_results”, (2) the required data for the computation of Information Content (IC) with respect to category hyponyms and category pages in the sub-folder names as “predate_fr”, (3) the disambiguated French Wikipedia concepts in the file named as “disambiguated_benchmark.csv”, (4) the French Wikipedia concepts page ids in the file named as “fr_RG65_pageid.csv”, (5) the French Wikipedia page associated categories in the file named as “fr_RG65_page_categories.txt”, (6) the source code to compute the semantic similarity between the concepts of French Wikipedia using IC with respect to category hyponyms in the file named as “RG_French_Sim_IC_hypos.txt”, (7) the source code to compute the semantic similarity between the concepts of French Wikipedia using IC with respect to category pages in the file named as “RG_French_Sim_IC_pages.txt.”, and (8) the source code to reproduce the data associated to Table 3 in the file named as “Table3_French.txt”.
These data folders provide all the necessary pre-processed data files to execute the python-based program to reproduce the experimental results of our semantic similarity methods and further analysis on the graphical structures of different Wikipedia category graphs.
本数据文章提供了用于计算语义相似度的实验数据,所涉概念(词语)取自金标准词语相似度基准测试集MC30(英语)、RG65(西班牙语)与RG65(法语)。本数据集与M. J. Hussain等人提出的基于多重继承的语义相似度方法相关。
本数据集包含四个文件夹,分别命名为"Benchmarks_results_graphs"、"French_RG65"、"MC30"与"Spanish_RG65"。其中,"Benchmarks_results_graphs"文件夹存储了英语(MC30)、法语(RG65)及西班牙语(RG65)基准测试的实验结果皮尔逊相关系数值。"French_RG65"、"MC30"及"Spanish_RG65"文件夹则包含了执行基于Python程序所需的全部预处理数据文件,用于依据本研究方法计算法语、英语及西班牙语维基百科概念间的语义相似度。
以"French_RG65"文件夹为例,其内部包含:(1) 名为"French_RG65_results"的子文件夹,存储针对法语RG65基准测试的实验结果;(2) 名为"predate_fr"的子文件夹,存储针对类别下位词与类别页面的信息内容(Information Content, IC)计算所需数据;(3) 名为"disambiguated_benchmark.csv"的文件,存储已消歧的法语维基百科概念;(4) 名为"fr_RG65_pageid.csv"的文件,存储法语维基百科概念页面ID;(5) 名为"fr_RG65_page_categories.txt"的文件,存储法语维基百科页面关联类别;(6) 名为"RG_French_Sim_IC_hypos.txt"的文件,存储基于类别下位词信息内容计算法语维基百科概念间语义相似度的源代码;(7) 名为"RG_French_Sim_IC_pages.txt"的文件,存储基于类别页面信息内容计算法语维基百科概念间语义相似度的源代码;(8) 名为"Table3_French.txt"的文件,存储复现表3相关数据的源代码。
上述数据文件夹提供了全部必要的预处理数据文件,可用于执行基于Python的程序,以复现本研究语义相似度方法的实验结果,并可针对不同维基百科类别图的图形结构开展进一步分析。
创建时间:
2020-02-25



