Predicting Diabetes From Tracking Medical Records
收藏DataCite Commons2026-03-20 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/nxnty5g7y6/1
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains 1,168 medical records designed for predicting the onset of diabetes based on routine diagnostic measurements. Each record includes eight clinical features commonly collected during standard health screenings, along with a binary outcome variable indicating whether the patient was diagnosed with diabetes.
Features:
Pregnancies: Number of times the patient has been pregnant
Glucose: Plasma glucose concentration from a 2-hour oral glucose tolerance test (mg/dL)
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skinfold thickness (mm)
Insulin: 2-hour serum insulin level (μU/mL)
BMI: Body mass index calculated as weight in kg / (height in m)²
DiabetesPedigreeFunction: A composite score reflecting the likelihood of diabetes based on family history
Age: Age of the patient in years
Outcome: Binary target variable (1 = diabetes diagnosed, 0 = no diabetes)
The dataset comprises 771 negative cases and 397 positive cases, representing a class imbalance ratio of approximately 66:34. Patient ages range from 21 to 81 years. Some feature columns contain zero values (e.g., Glucose, BloodPressure, SkinThickness, Insulin, BMI) that likely represent missing or unrecorded measurements rather than true biological zeros; researchers should account for this during preprocessing.
This dataset is well suited for supervised binary classification tasks and can be used to benchmark machine learning models such as logistic regression, decision trees, random forests, gradient boosting, support vector machines, and neural networks. It is also appropriate for educational purposes in data science and healthcare analytics curricula, including exercises in exploratory data analysis, feature engineering, handling missing values, class imbalance techniques, and model evaluation.
The data was prepared and exported from VertexMD, a local-first electronic health records application designed for personal medical record tracking and interoperability research.
本数据集包含1168条医疗记录,旨在基于常规诊断检测指标预测糖尿病发病风险。每条记录包含八项标准健康筛查中常见的临床特征,以及一个二分类结果变量,用于指示患者是否被确诊为糖尿病。
特征:
Pregnancies(妊娠次数):患者既往妊娠次数
Glucose(葡萄糖):口服葡萄糖耐量试验2小时后的血浆葡萄糖浓度(单位:mg/dL)
BloodPressure(血压):舒张压(单位:mm Hg)
SkinThickness(皮肤厚度):三头肌皮褶厚度(单位:mm)
Insulin(胰岛素):2小时血清胰岛素水平(单位:μU/mL)
Body Mass Index(身体质量指数,BMI):以体重(kg)除以身高(m)的平方计算得出
DiabetesPedigreeFunction(糖尿病家系功能评分):基于家族史反映糖尿病发病可能性的综合评分
Age(年龄):患者年龄(单位:岁)
Outcome(结局):二分类目标变量(1=确诊糖尿病,0=未患糖尿病)
该数据集包含771例阴性病例与397例阳性病例,类别不平衡比例约为66:34。患者年龄跨度为21至81岁。部分特征列存在零值(如葡萄糖、血压、皮肤厚度、胰岛素、BMI),此类零值大概率代表缺失或未记录的检测结果,而非真实的生物学零值;研究人员在数据预处理阶段需对此进行合理处理。
本数据集非常适用于监督式二分类任务,可用于基准测试各类机器学习模型,包括逻辑回归、决策树、随机森林、梯度提升机、支持向量机以及神经网络。同时,该数据集也适用于数据科学与医疗健康分析课程的教学场景,可用于开展探索性数据分析、特征工程、缺失值处理、类别不平衡处理以及模型评估等相关教学练习。
该数据集由VertexMD整理并导出,VertexMD是一款本地优先的电子健康记录应用,旨在支持个人医疗记录追踪与互操作性研究。
提供机构:
Mendeley Data
创建时间:
2026-03-20



