McGill-NLP/medal
收藏数据集卡片 for MeDAL 数据集
数据集描述
数据集摘要
MeDAL 是一个大型医学文本数据集(14GB),经过筛选为4GB,用于医学领域的自然语言理解预训练,专注于缩写词消歧。例如,DHF 可以消歧为 dihydrofolate、diastolic heart failure、dengue hemorragic fever 或 dihydroxyfumarate。
支持的任务和排行榜
医学缩写词消歧
语言
英语(en)
数据集结构
数据实例
训练集中的一个示例:
json { "abstract_id": 14145090, "text": "velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfβ receptors and disrupts the tgfβ pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfβ signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfβ and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfβ expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in αsmapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfβ IB to its receptors and blocking ang iitgfβ downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfβ pathway in CFs", "location": [63], "label": ["transverse aortic constriction"] }
数据字段
text: 摘要内容,字符串类型location: 替换位置的索引,整数类型label: 替换的词,字符串类型
数据分割
数据集包含以下文件:
full_data.csv: 完整数据集,包含所有1400万篇摘要train.csv: 用于训练基线和提出模型的子集valid.csv: 用于在训练期间验证模型以进行超参数选择的子集test.csv: 用于评估模型并在表格中报告结果的子集
数据集创建
数据来源
原始数据集从 NLM 网站 获取并修改。
数据集信息
- 特征:
abstract_id: 摘要ID,整数类型text: 文本内容,字符串类型location: 位置序列,整数类型label: 标签序列,字符串类型
- 分割:
train: 3573399948字节,3000000个示例test: 1190766821字节,1000000个示例validation: 1191410723字节,1000000个示例full: 15536883723字节,14393619个示例
- 下载大小: 21060929078字节
- 数据集大小: 21492461215字节
使用数据的注意事项
数据集的社会影响
[更多信息需要]
偏见的讨论
由于摘要是用英语编写的,数据偏向于盎格鲁中心的医学研究。如果您计划在主要非英语社区使用在此数据集上预训练的模型,重要的是验证模型中是否存在负面偏见,并确保正确缓解这些偏见。例如,您可以在多语言医学消歧数据集上微调您的数据集,或收集特定于您用例的数据集。
其他已知限制
[更多信息需要]
附加信息
数据集策展人
[更多信息需要]
许可信息
数据集的许可信息未知。
引用信息
bibtex @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", abstract = "One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.", }
贡献




