Data from: Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

Mendeley Data2024-06-25 更新2024-06-27 收录

下载链接：

https://zenodo.org/records/4958966

下载链接

链接失效反馈

官方服务：

资源简介：

Objectives: Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients. Methods: We obtained 5,889 de-identified progress reports (2,439 words on average) for 755 cancer patients who have undergone a clinical Next Generation Sequencing (NGS) testing in Wake Forest Baptist Comprehensive Cancer Center for our data analyses. An NLP system was implemented to process the free-text data and extract NGS-related information. Three types of recurrent neural network (RNN) namely, gated recurrent unit (GRU), long-short term memory (LSTM), and bidirectional LSTM (LSTM_Bi) were applied to classify documents to the treatment-change and no-treatment-change groups. Further, we compared the performances of RNNs to five machine learning algorithms including Naive Bayes (NB), K-nearest Neighbor (KNN), Support Vector Machine for classification (SVC), Random Forest (RF), and Logistic Regression (LR). Results: Our results suggested that, overall, RNNs outperformed traditional machine learning algorithms, and LSTM_Bi showed the best performance among the RNNs in terms of accuracy, precision, recall, and F1 score. In addition, pre-trained word embedding can improve the accuracy of LSTM by 3.4% and reduce the training time by more than 60%. Discussion and Conclusion: NLP and RNN-based text mining solutions have demonstrated advantages in information retrieval and document classification tasks for unstructured clinical progress notes.

研究目标：本研究采用自然语言处理（Natural Language Processing, NLP）与机器学习方法构建分类器，以识别癌症患者自由文本就诊病程记录中与基因组学相关的治疗方案变更情况。研究方法：我们从维克森林浸信会综合癌症中心（Wake Forest Baptist Comprehensive Cancer Center）获取了755名接受过临床下一代测序（Next Generation Sequencing, NGS）检测的癌症患者的5889份已去标识化的病程报告（平均每份报告含2439个词）用于数据分析。本研究搭建了自然语言处理系统以处理自由文本数据并提取与NGS相关的信息。我们采用三类循环神经网络（Recurrent Neural Network, RNN），即门控循环单元（Gated Recurrent Unit, GRU）、长短期记忆网络（Long Short-Term Memory, LSTM）以及双向长短期记忆网络（Bidirectional LSTM, LSTM_Bi），将文档划分为治疗方案变更组与未变更组。此外，我们将上述循环神经网络的分类性能与五种机器学习算法进行对比，包括朴素贝叶斯（Naive Bayes, NB）、k近邻（K-Nearest Neighbor, KNN）、分类支持向量机（Support Vector Machine for Classification, SVC）、随机森林（Random Forest, RF）以及逻辑回归（Logistic Regression, LR）。研究结果：整体而言，循环神经网络的分类性能优于传统机器学习算法，其中双向长短期记忆网络在准确率、精确率、召回率与F1值各项指标上均表现最优。此外，预训练词嵌入可使长短期记忆网络的准确率提升3.4%，并将训练时长缩短60%以上。讨论与结论：基于自然语言处理与循环神经网络的文本挖掘方案，在非结构化临床病程记录的信息检索与文档分类任务中展现出显著优势。

创建时间：

2023-06-28