Hype - PubMed dataset
收藏doi.org2025-01-16 收录
下载链接:
https://doi.org/10.13012/B2IDB-0651259_V1
下载链接
链接失效反馈官方服务:
资源简介:
Hype - PubMed dataset Prepared by Apratim Mishra This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. The candidate hype words are 36 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'bright', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful'. File 1: hype_dataset.csv Primary dataset. It has the following columns: 1. PMID: represents unique article ID in PubMed 2. Hype_word: Candidate hype word, such as ‘novel.’ 3. Sentence: Sentence in abstract containing the hype word. 4. Abstract_length: Length of article abstract. 5. Hype_percentile: Abstract relative position of hype word. 6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location. 7. Introduction: The ‘I’ component of the hype word based on IMRaD 8. Methods: The ‘M’ component of the hype word based on IMRaD 9. Results: The ‘R’ component of the hype word based on IMRaD 10. Discussion: The ‘D’ component of the hype word based on IMRaD File 2: hype_removed_phrases.csv Secondary dataset with same columns as File 1. Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases: 1. Major: histocompatibility, component, protein, metabolite, complex, surgery 2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid 3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment 4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration 5. Essential: medium, features, properties, opportunities 6. Unique: model, amino 7. Robust: regression 8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information 9. Outstanding: questions, issues, question, challenge, problems, problem, remains 10. Remarkable: properties 11. Definite: radiotherapy, surgery 12. Bright: field
Hype - PubMed 数据集,由 Apratim Mishra 编制。该数据集捕捉了来自 PubMed 的生物医学摘要中的‘炒作’现象。所选样本为英文撰写的‘期刊文章’,发表于 1975 年至 2019 年间,总计约 520 万篇。分类依据为特定候选‘炒作词汇’及其在摘要中的位置。因此,由于不同摘要句子中可能存在多个炒作词汇,每篇文章在数据集中可能存在多个实例。候选炒作词汇共计 36 个:包括‘major’、‘novel’、‘central’、‘critical’、‘essential’、‘strongly’、‘unique’、‘promising’、‘markedly’、‘excellent’、‘crucial’、‘robust’、‘importantly’、‘prominent’、‘dramatically’、‘favorable’、‘vital’、‘surprisingly’、‘remarkably’、‘remarkable’、‘definitive’、‘pivotal’、‘innovative’、‘supportive’、‘encouraging’、‘unprecedented’、‘bright’、‘enormous’、‘exceptional’、‘outstanding’、‘noteworthy’、‘creative’、‘assuring’、‘reassuring’、‘spectacular’以及‘hopeful’。文件 1:hype_dataset.csv,为主数据集。包含以下列:1. PMID:表示 PubMed 中独特文章 ID;2. Hype_word:候选炒作词汇,如‘novel’;3. Sentence:包含炒作词汇的摘要句子;4. Abstract_length:文章摘要长度;5. Hype_percentile:炒作词汇在摘要中的相对位置;6. Hype_value:基于炒作词汇、句子及摘要位置的炒作倾向;7. Introduction:根据 IMRaD 的‘I’组件的炒作词汇;8. Methods:根据 IMRaD 的‘M’组件的炒作词汇;9. Results:根据 IMRaD 的‘R’组件的炒作词汇;10. Discussion:根据 IMRaD 的‘D’组件的炒作词汇。文件 2:hype_removed_phrases.csv,为与文件 1 相同列的次级数据集。主数据集中的炒作基于排除某些罕见炒作的短语,被移除的短语包含在文件 2 中并单独建模。移除的短语包括:1. Major:histocompatibility、component、protein、metabolite、complex、surgery;2. Novel:assay、mutation、antagonist、inhibitor、algorithm、technique、series、method、hybrid;3. Central:catheters、system、design、composite、catheter、pressure、thickness、compartment;4. Critical:compartment、micelle、temperature、incident、solution、ischemia、concentration;5. Essential:medium、features、properties、opportunities;6. Unique:model、amino;7. Robust:regression;8. Vital:capacity、signs、organs、status、structures、staining、rates、cells、information;9. Outstanding:questions、issues、question、challenge、problems、problem、remains;10. Remarkable:properties;11. Definite:radiotherapy、surgery;12. Bright:field。
提供机构:
Illinois Data Bank



