Hype - PubMed dataset
收藏DataCite Commons2025-03-14 更新2025-04-16 收录
下载链接:
https://databank.illinois.edu/datasets/IDB-5892739
下载链接
链接失效反馈官方服务:
资源简介:
Hype - PubMed dataset
Prepared by Apratim Mishra
This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.
The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.
This is version 3 of the dataset. Added new file - WSD_hype.tsv
File 1: hype_dataset_final.tsv
Primary dataset. It has the following columns:
1. PMID: represents unique article ID in PubMed
2. Year: Year of publication
3. Hype_word: Candidate hype word, such as ‘novel.’
4. Sentence: Sentence in abstract containing the hype word.
5. Hype_percentile: Abstract relative position of hype word.
6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.
7. Introduction: The ‘I’ component of the hype word based on IMRaD
8. Methods: The ‘M’ component of the hype word based on IMRaD
9. Results: The ‘R’ component of the hype word based on IMRaD
10. Discussion: The ‘D’ component of the hype word based on IMRaD
File 2: hype_removed_phrases_final.tsv
Secondary dataset with same columns as File 1.
Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:
1. Major: histocompatibility, component, protein, metabolite, complex, surgery
2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid
3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment
4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values
5. Essential: medium, features, properties, opportunities, oil
6. Unique: model, amino
7. Robust: regression
8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information
9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains
10. Remarkable: properties
11. Definite: radiotherapy, surgery
File 3: WSD_hype.tsv
Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)
Hype - PubMed 数据集
本数据集由Apratim Mishra整理。
本数据集收录源自PubMed数据库的生物医学摘要中的「炒作(Hype)」相关内容。本次选取的样本为1975年至2019年间发表的英文期刊论文,总计约520万篇。分类依据为特定候选「炒作词汇」的存在及其在摘要中的位置。因此,由于单篇论文的PubMed唯一标识符(PubMed Unique Identifier, PMID)对应的摘要不同句子中可能存在多个炒作词汇,每篇论文可能在数据集中对应多条记录。
候选炒作词汇共计35个,分别为:major、novel、central、critical、essential、strongly、unique、promising、markedly、excellent、crucial、robust、importantly、prominent、dramatically、favorable、vital、surprisingly、remarkably、remarkable、definitive、pivotal、innovative、supportive、encouraging、unprecedented、enormous、exceptional、outstanding、noteworthy、creative、assuring、reassuring、spectacular及hopeful。
本数据集为第3版,新增文件WSD_hype.tsv。
文件1:hype_dataset_final.tsv
主数据集,包含以下字段:
1. PMID:PubMed数据库中论文的唯一标识符
2. Year:论文发表年份
3. Hype_word:候选炒作词汇,如"novel"
4. Sentence:包含该炒作词汇的摘要句子
5. Hype_percentile:炒作词汇在摘要中的相对位置百分位
6. Hype_value:基于炒作词汇、所在句子及摘要位置计算的炒作倾向得分
7. Introduction:基于引言-方法-结果-讨论(IMRaD)结构的炒作词汇所属「引言(I)」模块标签
8. Methods:基于IMRaD结构的炒作词汇所属「方法(M)」模块标签
9. Results:基于IMRaD结构的炒作词汇所属「结果(R)」模块标签
10. Discussion:基于IMRaD结构的炒作词汇所属「讨论(D)」模块标签
文件2:hype_removed_phrases_final.tsv
辅助数据集,字段与文件1完全一致。主数据集的炒作标注基于排除部分极少构成炒作语境的短语,这些被移除的短语已收录于文件2并单独建模。被移除的短语如下:
1. 与major相关:histocompatibility、component、protein、metabolite、complex、surgery
2. 与novel相关:assay、mutation、antagonist、inhibitor、algorithm、technique、series、method、hybrid
3. 与central相关:catheters、system、design、composite、catheter、pressure、thickness、compartment
4. 与critical相关:compartment、micelle、temperature、incident、solution、ischemia、concentration、thinking、nurses、skills、analysis、review、appraisal、evaluation、values
5. 与essential相关:medium、features、properties、opportunities、oil
6. 与unique相关:model、amino
7. 与robust相关:regression
8. 与vital相关:capacity、signs、organs、status、structures、staining、rates、cells、information
9. 与outstanding相关:questions、issues、question、questions、challenge、problems、problem、remains
10. 与remarkable相关:properties
11. 与definitive相关:radiotherapy、surgery
文件3:WSD_hype.tsv
包含针对词义消歧(Word Sense Disambiguation, WSD)任务的候选词汇的炒作相关消歧标注。
提供机构:
University of Illinois Urbana-Champaign
创建时间:
2025-03-14



