Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents"

Name: Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents"
Creator: Illinois Data Bank
License: 暂无描述

doi.org2025-01-16 收录

下载链接：

https://doi.org/10.13012/B2IDB-8020612_V1

下载链接

链接失效反馈

官方服务：

资源简介：

# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency. * twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts. * mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation. * mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship. ## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper 2. pmid_two: the PubMed unique identifier of the second paper 3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character 4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character 5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper 6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation 7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation. ## MeSH Training Data The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper 2. term: a candidate MeSH term 3. cit_count: the log of the frequency of the term in the citation candidate set 4. total_cit: the log of the total number the paper's citations 5. citr_count: the log of the frequency of the term in the citations of the paper's citations 6. total_citofcit: the log of the total number of the citations of the paper's citations 7. absim_count: the log of the frequency of the term in the AbSim candidate set 8. total_absim_count: the log of the total number of AbSim records for the paper 9. absimr_count: the log of the frequency of the term in the citations of the AbSim records 10. total_absimr_count: the log of the total number of citations of the AbSim record 11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE. 12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper ## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading. 2. mesh_two: a string of the second MeSH heading. 3. cosine_similarity: the cosine similarity between the terms 4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match). The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."

{'* twin_not_abstract_matched_complete.tsv: 一份由制表符分隔的文件，包含具有相同标题、作者及出版年份的MEDLINE文章对。该文件包含重复出版物的PMID、其医学主题词（MeSH）及其索引一致性的三项度量。': '* twin_abstract_matched_complete.tsv: 与上述文件结构相同，区别在于MEDLINE文章还具有匹配的摘要。', '* mesh_training_data.csv: 包含论文中讨论的模型训练数据的逗号分隔文件。': '* mesh_scores.tsv: 一份由制表符分隔的文件，包含基于词嵌入的成对相似度评分以及MeSH层级关系。', '# Duplicate MEDLINE Publications: 重复的MEDLINE出版物': 'Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: 第一篇论文的PubMed唯一标识符 2. pmid_two: 第二篇论文的PubMed唯一标识符 3. mesh_one: 第一篇论文的医学主题词（MeSH）列表，以“|”字符分隔 4. mesh_two: 第二篇论文的医学主题词列表，以“|”字符分隔 5. hoopers_consistency: 第一篇与第二篇论文的MeSH之间Hooper一致性计算 6. nonhierarchicalfree: 论文中描述的基于词嵌入的一致性评分 7. hierarchicalfree: 在MeSH层级限制下的基于词嵌入的一致性评分，论文中有所描述。', '# MeSH Training Data: MeSH训练数据': 'The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: 论文的PubMed唯一标识符 2. term: 候选MeSH术语 3. cit_count: 引用候选集中术语的频率的对数 4. total_cit: 论文引用总数的对数 5. citr_count: 论文引用中术语的频率的对数 6. total_citofcit: 论文引用的引用总数的对数 7. absim_count: AbSim候选集中术语的频率的对数 8. total_absim_count: 论文的AbSim记录总数的对数 9. absimr_count: AbSim记录引用中术语的频率的对数 10. total_absimr_count: AbSim记录引用的引用总数的对数 11. log_medline_frequency: 候选术语在MEDLINE中的频率的对数 12. relevance: 二进制指标（True/False），指示候选术语是否被分配给目标论文。', '# Cosine Similarity: 余弦相似度': 'The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: 第一MeSH标题的字符串 2. mesh_two: 第二MeSH标题的字符串 3. cosine_similarity: 术语之间的余弦相似度 4. relationship_type: 标识关系类型的字符串，包括无、父子、兄弟、祖先和直接（术语相同，即直接层级匹配）。The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."', '# Overview': '概述：本数据集之构建与Adam Kehoe所撰博士学位论文《基于文本与引证预测受控词汇：以MEDLINE医学主题词及专利为例之案例研究》相互呼应。数据集包含以下内容：'}

提供机构：

Illinois Data Bank