Data Sheet 1_Identification and validation of respiratory virus immunization using natural language processing.docx
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Data_Sheet_1_Identification_and_validation_of_respiratory_virus_immunization_using_natural_language_processing_docx/31230355
下载链接
链接失效反馈官方服务:
资源简介:
IntroductionElectronic health record (EHR)-based research often relies on structured data elements, such as ICD-10-CM and CPT codes, to identify clinical diagnoses and procedures. However, some information, such as the administration of immunizations, may be captured more reliably in the text-based narrative sections of the patient's record. We developed a rule-based natural language processing (NLP) algorithm to identify the administration of immunizations for COVID-19, influenza, and RSV using a combination of synthetic and publicly available data.
MethodsAfter applying standard NLP processing techniques to clean and standardize the text, we implemented a multi-stage, rule-based algorithm. We applied a dictionary of general keywords to identify potential immunizations, and a set of specific keywords, which leveraged grammatical dependencies in the text, to increase accuracy. We implemented additional rules to account for negation and immunization recommendations. The algorithm was applied to a sample of 20,000 patients from the study population. We measured performance by conducting a manual review of 400 individual notes and assessing concurrence with structured data, using precision and recall as evaluation metrics.
ResultsIn the first evaluation, which compared the performance of the algorithm with an independent test dataset using manual clinical review, precision was 71% and recall was 97% for COVID-19 immunization; 91% and 92% for Influenza; and 57% and 96% for RSV. In a second evaluation using structured data as the gold standard (i.e., ICD-10-CM, CPT, and CVX codes), precision was 72% and recall was 9% for COVID-19 immunization; 71% and 12% for Influenza; and for RSV, precision was 78% and recall was 10%.
DiscussionWe demonstrated the effectiveness of NLP methods in identifying immunizations from EHR. High precision and recall for COVID-19 and influenza immunizations suggest that the algorithm can effectively identify immunization references when they are present in the text; however, low recall when compared to the structured data suggests that there are many more immunizations in the structured data not present in the text. Thus, the algorithm has specialized utility for augmenting immunization records using text data from individual notes; however, the algorithm's extensibility and generalizability can serve as a framework for future EHR-based research.
创建时间:
2026-02-02



