Machine Learning Models for Predicting Infant Undernutrition in Rural Rajasthan, India Code and Dataset
收藏DataCite Commons2024-05-16 更新2024-08-19 收录
下载链接:
https://figshare.com/articles/dataset/Machine_Learning_Models_for_Predicting_Infant_Undernutrition_in_Rural_Rajasthan_India_Code_and_Dataset/25833919
下载链接
链接失效反馈官方服务:
资源简介:
Machine Learning Models for Predicting Infant Undernutrition in Rural Rajasthan, India<br>Objective:1. Develop machine learning models to predict the following health outcomes and assess their performance from pretest probability to post probability:● Low birth weight model compared to those reported as normal births.● Severe underweight reported before 2 months compared to those reported as normal weight.● Severe underweight reported before 6 months compared to those reported as normal weight.● Secure acute malnutrition in children found after household visits by KB monitors compared to those found to be normal.● Severe underweight persistence cases compared to where the severe underweight status was resolved after treatment.Pre-test to post-test probability serves as a key metric in public health machine learning models, indicating the extent of change in the likelihood of a health outcome following diagnostic testing or intervention, thereby informing clinical decision-making and resource allocation.2. Perform a comparative study of three machine learning models:● Multivariate logistic regression● Random forest● Deep neural networkDatasets:1. KhushiBaby_Cleaned_Anonymized_Dataset2.csv: Variables associated with pregnant women receiving antenatal care checkups by ANMs during the maternal and child health nutrition session (MCHN), including socialdemographic variables.2. Master_children_dataset_Udaipur.csv: Variables associated with children receiving immunization services by ANMs during the MCHN session, including social demographic variables. Pregnancy ID is linked to children if both the mother and child are tracked in the Khushi Baby system3. all_scores2.csv: The data quality score of ANMs, based on their qualitative work during the MCHN day, is determined in a time-series manner. The calculation is based on a rule-based approach, as also detailed in the publication. https://research.google/pubs/measuring-data-collection-diligence-for-community-healthcare/.4. Malnutrition.xlsx: Aggregated variables at the community level in a time-series manner to understand seasonal trends, blind spots, and hot spots in malnutrition.5. Rch_23_geo_mpi.csv: Multidimensional Poverty Index (MPI) scores at the most granular village level, determined based on the principles of Global MPI. https://docs.google.com/document/d/1VIEIyRRc3F8wqPuKg-jjmQ5_wN7P51z8CpCR2Sjc6c8/editData Dictionary & Keys - Values.xlsx contains the data dictionary with keys and values for all five datasets.Scripts :Each of the 5 health outcomes has its own associated Python script, allowing for precise debugging and review. The code is extensively commented for clarity.1. 1_low_birth_weight_model.py: Script for developing the low birth weight prediction model.2. 2_severe_underweight_at_<=2_months.py: Script for predicting severe underweight at <=2 months.3. 3_severe_underweight_<=6_months.py: Script for predicting severe underweight at <=6 months.4. 4_SAM_Household_Visit_By_KB_Monitors.py: Script for predicting secure acute malnutrition during household visits by KB monitors.5. 5_Infant_severe_underweight_persisted_or_resolved.py: Script for predicting persistence or resolution of severe underweight in infants.Additionally, there is an R script for plotting the Fagan nomogram, which facilitates the visualization of predictive performance from pretest to post-test probability.Analysis Methodology:1. Data wrangling encompasses preprocessing variables from the five datasets mentioned above and generating potential predictors for each health outcome based on existing literature. Exploratory data analysis has been omitted from this research.2. Generalized Linear Model (GLM) is employed to identify statistically significant variables with p-values < 0.05.3. For logistic regression, variables identified via GLM undergo scrutiny for assumptions; those violating assumptions are excluded from regression analysis.4. Random forest and deep neural network models incorporate variables identified via GLM along with mandatory variables, ANM average data quality score, and multidimensional poverty indexing, as they are prominent factors of adverse health outcomes. These models are particularly suitable for handling outliers and imbalanced datasets, especially random forest models.5. Following feature selection, Synthetic Minority Over-sampling Technique (SMOTE) is applied on all 3 machine learning models to handle the highly imbalanced minority class. The models are then trained and tested on unseen observations.6. Various evaluation metrics for all three models are determined and compared, focusing on parameters such as positive predictive value, sensitivity, and specificity.7. Confusion matrix, ROC curve and Fagan's Nomogram are plotted for all 5 models8. SHAP values are calculated and visualized for interpreting the data, followed by comparisons with the literature, leading to final interpretations.
提供机构:
figshare
创建时间:
2024-05-16



