Machine Learning Models for Predicting Infant Undernutrition in Rural Rajasthan, India Code and Dataset

Name: Machine Learning Models for Predicting Infant Undernutrition in Rural Rajasthan, India Code and Dataset
Creator: figshare
Published: 2025-06-01 06:08:54
License: 暂无描述

DataCite Commons2025-06-01 更新2024-08-19 收录

下载链接：

https://figshare.com/articles/dataset/Machine_Learning_Models_for_Predicting_Infant_Undernutrition_in_Rural_Rajasthan_India_Code_and_Dataset/25833919/1

下载链接

链接失效反馈

官方服务：

资源简介：

Machine Learning Models for Predicting Infant Undernutrition in Rural Rajasthan, India<br>Objective:1. Develop machine learning models to predict the following health outcomes and assess their performance from pretest probability to post probability:● Low birth weight model compared to those reported as normal births.● Severe underweight reported before 2 months compared to those reported as normal weight.● Severe underweight reported before 6 months compared to those reported as normal weight.● Secure acute malnutrition in children found after household visits by KB monitors compared to those found to be normal.● Severe underweight persistence cases compared to where the severe underweight status was resolved after treatment.Pre-test to post-test probability serves as a key metric in public health machine learning models, indicating the extent of change in the likelihood of a health outcome following diagnostic testing or intervention, thereby informing clinical decision-making and resource allocation.2. Perform a comparative study of three machine learning models:● Multivariate logistic regression● Random forest● Deep neural networkDatasets:1. KhushiBaby_Cleaned_Anonymized_Dataset2.csv: Variables associated with pregnant women receiving antenatal care checkups by ANMs during the maternal and child health nutrition session (MCHN), including socialdemographic variables.2. Master_children_dataset_Udaipur.csv: Variables associated with children receiving immunization services by ANMs during the MCHN session, including social demographic variables. Pregnancy ID is linked to children if both the mother and child are tracked in the Khushi Baby system3. all_scores2.csv: The data quality score of ANMs, based on their qualitative work during the MCHN day, is determined in a time-series manner. The calculation is based on a rule-based approach, as also detailed in the publication. https://research.google/pubs/measuring-data-collection-diligence-for-community-healthcare/.4. Malnutrition.xlsx: Aggregated variables at the community level in a time-series manner to understand seasonal trends, blind spots, and hot spots in malnutrition.5. Rch_23_geo_mpi.csv: Multidimensional Poverty Index (MPI) scores at the most granular village level, determined based on the principles of Global MPI. https://docs.google.com/document/d/1VIEIyRRc3F8wqPuKg-jjmQ5_wN7P51z8CpCR2Sjc6c8/editData Dictionary & Keys - Values.xlsx contains the data dictionary with keys and values for all five datasets.Scripts :Each of the 5 health outcomes has its own associated Python script, allowing for precise debugging and review. The code is extensively commented for clarity.1. 1_low_birth_weight_model.py: Script for developing the low birth weight prediction model.2. 2_severe_underweight_at_<=2_months.py: Script for predicting severe underweight at <=2 months.3. 3_severe_underweight_<=6_months.py: Script for predicting severe underweight at <=6 months.4. 4_SAM_Household_Visit_By_KB_Monitors.py: Script for predicting secure acute malnutrition during household visits by KB monitors.5. 5_Infant_severe_underweight_persisted_or_resolved.py: Script for predicting persistence or resolution of severe underweight in infants.Additionally, there is an R script for plotting the Fagan nomogram, which facilitates the visualization of predictive performance from pretest to post-test probability.Analysis Methodology:1. Data wrangling encompasses preprocessing variables from the five datasets mentioned above and generating potential predictors for each health outcome based on existing literature. Exploratory data analysis has been omitted from this research.2. Generalized Linear Model (GLM) is employed to identify statistically significant variables with p-values < 0.05.3. For logistic regression, variables identified via GLM undergo scrutiny for assumptions; those violating assumptions are excluded from regression analysis.4. Random forest and deep neural network models incorporate variables identified via GLM along with mandatory variables, ANM average data quality score, and multidimensional poverty indexing, as they are prominent factors of adverse health outcomes. These models are particularly suitable for handling outliers and imbalanced datasets, especially random forest models.5. Following feature selection, Synthetic Minority Over-sampling Technique (SMOTE) is applied on all 3 machine learning models to handle the highly imbalanced minority class. The models are then trained and tested on unseen observations.6. Various evaluation metrics for all three models are determined and compared, focusing on parameters such as positive predictive value, sensitivity, and specificity.7. Confusion matrix, ROC curve and Fagan's Nomogram are plotted for all 5 models8. SHAP values are calculated and visualized for interpreting the data, followed by comparisons with the literature, leading to final interpretations.

印度拉贾斯坦邦农村地区婴儿营养不良预测的机器学习模型 ### 研究目标 1. 开发机器学习模型以预测下述健康结局，并从检验前概率到检验后概率的维度评估其性能： ● 低出生体重模型：与正常分娩报告结果进行对比 ● 2月龄前报告的重度体重不足模型：与正常体重报告结果进行对比 ● 6月龄前报告的重度体重不足模型：与正常体重报告结果进行对比 ● KB监测员入户走访后发现的儿童确诊急性营养不良模型：与健康正常儿童结果进行对比 ● 重度体重不足持续病例模型：与经治疗后重度体重不足状态得到缓解的病例进行对比检验前至检验后概率是公共卫生机器学习模型中的核心指标，可反映诊断检测或干预后健康结局发生概率的变化幅度，从而为临床决策与资源配置提供依据。 2. 开展三种机器学习模型的对比研究： ● 多变量逻辑回归 ● 随机森林 ● 深度神经网络 ### 数据集 1. KhushiBaby_Cleaned_Anonymized_Dataset2.csv：包含孕产妇与儿童健康营养（Maternal and Child Health Nutrition, MCHN）活动中，辅助助产士（Auxiliary Nurse Midwife, ANMs）为孕妇提供产前检查相关的变量，其中涵盖社会人口学变量。 2. Master_children_dataset_Udaipur.csv：包含在MCHN活动中，ANMs为儿童提供免疫接种服务相关的变量，涵盖社会人口学变量。若母亲与儿童均在Khushi Baby系统中完成追踪，则可通过妊娠ID将二者进行关联。 3. all_scores2.csv：基于ANMs在MCHN当日的质性工作表现，以时间序列方式计算得到的ANMs数据质量评分。其计算采用基于规则的方法，相关细节已在论文《社区医疗数据采集尽责度评估》（Measuring Data Collection Diligence for Community Healthcare）中详述，链接为https://research.google/pubs/measuring-data-collection-diligence-for-community-healthcare/。 4. Malnutrition.xlsx：以时间序列方式聚合的社区层面变量，用于解析营养不良相关的季节趋势、盲区与热点区域。 5. Rch_23_geo_mpi.csv：基于全球多维贫困指数（Global Multidimensional Poverty Index, Global MPI）的原则，计算得到的最细粒度村级层面多维贫困指数（Multidimensional Poverty Index, MPI）评分。相关说明文档链接为https://docs.google.com/document/d/1VIEIyRRc3F8wqPuKg-jjmQ5_wN7P51z8CpCR2Sjc6c8/edit。 Data Dictionary & Keys - Values.xlsx包含全部5个数据集的数据字典及其键值对信息。 ### 配套脚本 5项健康结局均配有专属的Python脚本，可实现精准调试与审查。为提升可读性，代码中添加了大量注释。 1. 1_low_birth_weight_model.py：用于构建低出生体重预测模型的脚本 2. 2_severe_underweight_at_<=2_months.py：用于预测2月龄及以下重度体重不足的脚本 3. 3_severe_underweight_<=6_months.py：用于预测6月龄及以下重度体重不足的脚本 4. 4_SAM_Household_Visit_By_KB_Monitors.py：用于预测KB监测员入户走访时发现的确诊急性营养不良的脚本 5. 5_Infant_severe_underweight_persisted_or_resolved.py：用于预测婴儿重度体重不足持续或缓解状态的脚本此外，还配有一款R脚本用于绘制Fagan列线图，可实现检验前至检验后概率下的预测性能可视化。 ### 分析方法 1. 数据整理环节涵盖上述5个数据集的变量预处理，以及基于已有文献为每项健康结局生成潜在预测因子。本研究未开展探索性数据分析。 2. 采用广义线性模型（Generalized Linear Model, GLM）筛选p值小于0.05的具有统计学意义的变量。 3. 针对逻辑回归模型，需对通过GLM筛选得到的变量进行假设检验；违反模型假设的变量将被排除出回归分析。 4. 随机森林与深度神经网络模型将纳入通过GLM筛选得到的变量，以及强制纳入的变量：ANMs平均数据质量评分与多维贫困指数评分，上述因素均为不良健康结局的重要影响因子。此类模型尤其适用于处理异常值与不平衡数据集，其中随机森林模型的适配性尤为突出。 5. 特征选择完成后，针对全部3种机器学习模型应用合成少数类过采样技术（Synthetic Minority Over-sampling Technique, SMOTE）以解决严重不平衡的少数类样本问题。随后在未见过的观测样本上对模型进行训练与测试。 6. 计算并对比全部3种模型的各项评估指标，重点关注阳性预测值、灵敏度与特异度等参数。 7. 为全部5个模型绘制混淆矩阵、受试者工作特征（Receiver Operating Characteristic, ROC）曲线与Fagan列线图。 8. 计算并可视化SHAP值（SHapley Additive exPlanations）以实现模型可解释性分析，随后结合已有文献开展对比分析，最终得出研究结论。

提供机构：

figshare

创建时间：

2024-05-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集