McGill-NLP/medal

Name: McGill-NLP/medal
Creator: McGill-NLP
Published: 2023-06-13 12:39:11
License: 暂无描述

Hugging Face2023-06-13 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/McGill-NLP/medal

下载链接

链接失效反馈

官方服务：

资源简介：

MeDAL数据集是一个大型医学文本数据集，经过处理用于缩写消歧任务，旨在为医学领域的自然语言理解预训练提供支持。例如，缩写DHF可以消歧为dihydrofolate、diastolic heart failure、dengue hemorragic fever或dihydroxyfumarate。数据集包含14GB的原始数据，经过处理后为4GB，数据分为训练集、测试集、验证集和完整集。每个数据实例包括文本、位置和标签三个字段，文本为摘要的标准化内容，位置为缩写替换的索引，标签为替换的单词。数据集的创建基于NLM网站的数据，并经过专家生成的注释。

The MeDAL dataset is a large-scale medical text dataset processed for abbreviation disambiguation tasks, aiming to support pre-training for natural language understanding in the medical field. For instance, the abbreviation DHF can be disambiguated to dihydrofolate, diastolic heart failure, dengue hemorrhagic fever, or dihydroxyfumarate. The dataset contains 14 GB of raw data, which is reduced to 4 GB after processing, and is divided into training set, test set, validation set, and full set. Each data instance includes three fields: text, location, and label. The text is the standardized content of the abstract, the location denotes the index of the abbreviation to be replaced, and the label is the substituted word. The dataset is constructed based on data from the NLM website, with annotations generated by experts.

提供机构：

McGill-NLP

原始信息汇总

数据集卡片 for MeDAL 数据集

数据集描述

数据集摘要

MeDAL 是一个大型医学文本数据集（14GB），经过筛选为4GB，用于医学领域的自然语言理解预训练，专注于缩写词消歧。例如，DHF 可以消歧为 dihydrofolate、diastolic heart failure、dengue hemorragic fever 或 dihydroxyfumarate。

支持的任务和排行榜

医学缩写词消歧

语言

英语（en）

数据集结构

数据实例

训练集中的一个示例：

json { "abstract_id": 14145090, "text": "velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfÎ² receptors and disrupts the tgfÎ² pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfÎ² signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfÎ² and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfÎ² expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in Î±smapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfÎ² IB to its receptors and blocking ang iitgfÎ² downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfÎ² pathway in CFs", "location": [63], "label": ["transverse aortic constriction"] }

数据字段

text: 摘要内容，字符串类型
location: 替换位置的索引，整数类型
label: 替换的词，字符串类型

数据分割

数据集包含以下文件：

full_data.csv: 完整数据集，包含所有1400万篇摘要
train.csv: 用于训练基线和提出模型的子集
valid.csv: 用于在训练期间验证模型以进行超参数选择的子集
test.csv: 用于评估模型并在表格中报告结果的子集

数据集创建

数据来源

原始数据集从 NLM 网站获取并修改。

数据集信息

特征:
- abstract_id: 摘要ID，整数类型
- text: 文本内容，字符串类型
- location: 位置序列，整数类型
- label: 标签序列，字符串类型
分割:
- train: 3573399948字节，3000000个示例
- test: 1190766821字节，1000000个示例
- validation: 1191410723字节，1000000个示例
- full: 15536883723字节，14393619个示例
下载大小: 21060929078字节
数据集大小: 21492461215字节

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

由于摘要是用英语编写的，数据偏向于盎格鲁中心的医学研究。如果您计划在主要非英语社区使用在此数据集上预训练的模型，重要的是验证模型中是否存在负面偏见，并确保正确缓解这些偏见。例如，您可以在多语言医学消歧数据集上微调您的数据集，或收集特定于您用例的数据集。

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

数据集的许可信息未知。

引用信息

bibtex @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", abstract = "One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.", }

贡献

感谢 @Narsil 和 @xhlulu 添加此数据集。

搜集汇总

数据集介绍

构建方式

MeDAL数据集的构建基于从美国国家医学图书馆（NLM）获取的原始数据，经过专家生成的方式进行处理和标注。该数据集专注于医学领域的缩写消歧任务，通过对医学文献中的缩写进行标注，构建了一个包含1400万条摘要的大规模数据集。数据集的构建过程中，专家对缩写的具体含义进行了详细的标注，确保了数据的高质量和专业性。

特点

MeDAL数据集的主要特点在于其规模庞大且专注于医学领域的缩写消歧任务。数据集包含超过1400万条医学摘要，涵盖了广泛的医学研究领域。此外，数据集的标注由专家完成，确保了标注的准确性和可靠性。数据集的结构设计合理，包含文本、缩写位置和标签等关键字段，便于模型训练和评估。

使用方法

MeDAL数据集适用于医学领域的自然语言理解预训练任务，特别是缩写消歧任务。用户可以通过加载数据集的训练、验证和测试分割来进行模型训练和评估。数据集提供了详细的字段信息，包括文本内容、缩写位置和标签，便于模型对医学文本中的缩写进行准确消歧。此外，数据集还提供了预训练模型的链接，用户可以直接使用或在此基础上进行微调，以提升模型在医学领域的性能。

背景与挑战

背景概述

MeDAL数据集由McGill-NLP团队创建，旨在解决医学领域中缩写词歧义问题。该数据集于2020年发布，主要研究人员包括Zhi Wen、Xing Han Lu和Siva Reddy。MeDAL数据集的核心研究问题是通过自然语言理解预训练技术，提升医学文本中缩写词的自动消歧能力。该数据集从美国国家医学图书馆（NLM）的PubMed数据库中提取并修改，包含超过1400万条医学摘要，经过处理后用于训练和测试模型。MeDAL的发布对医学自然语言处理领域具有重要意义，为医学文本的自动理解和处理提供了丰富的资源。

当前挑战

MeDAL数据集面临的挑战主要集中在医学缩写词的歧义问题上。医学文本中广泛使用的缩写词往往具有多种解释，这增加了自动消歧的复杂性。此外，数据集的构建过程中，如何从庞大的原始数据中筛选和标注有效的缩写词实例，也是一个技术难题。另一个挑战是数据集的偏见问题，由于数据主要来源于英语医学文献，可能存在对非英语社区的偏见，这需要在模型应用时进行额外的验证和调整。

常用场景

经典使用场景

MeDAL数据集在医学领域中被广泛用于缩写词消歧任务，旨在通过自然语言理解预训练提升模型在医学文本中的表现。其经典使用场景包括训练和验证模型在医学文献中识别和正确解释缩写词的能力，如将‘DHF’正确解析为‘dihydrofolate’、‘diastolic heart failure’、‘dengue hemorrhagic fever’或‘dihydroxyfumarate’。

解决学术问题

MeDAL数据集解决了医学领域中缩写词消歧的常见学术问题，这一问题在医学文献中尤为突出，因为同一缩写词可能代表多个不同的医学术语。通过提供大规模的医学文本数据和专家标注的缩写词标签，MeDAL数据集为研究者提供了一个标准化的基准，用于评估和改进自然语言处理模型在医学文本中的表现，从而推动了医学自然语言处理领域的发展。

衍生相关工作

基于MeDAL数据集，研究者们开发了多种预训练模型，如ELECTRA-medal，这些模型在医学文本处理任务中表现出色。此外，MeDAL数据集还激发了关于医学文本预训练和缩写词消歧的进一步研究，包括探索更高效的模型架构、多语言医学文本处理以及跨领域知识迁移等方向。这些工作不仅提升了医学自然语言处理的理论研究水平，也为实际应用提供了强有力的技术支持。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集