Multi-Class Chronic Disease Data Warehouse (healthcare)

Mendeley Data2026-05-21 收录

下载链接：

https://data.mendeley.com/datasets/6vnkkf5hv3

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset represents an integrated medical data warehouse developed to support multi-class chronic disease prediction. It combines three publicly available healthcare datasets—diabetes, heart disease, and hypertension—sourced from Kaggle and unified using a Medallion Architecture (Bronze, Silver, Gold) implemented in Microsoft SQL Server. The final Gold-layer dataset contains 280,985 patient records and 38 features, with no missing values. Each record corresponds to a patient and includes both a binary classification (Normal/Abnormal) and an 8-class sublabel representing disease combinations, enabling advanced co-morbidity analysis and predictive modeling . The dataset is structured as a denormalized flat table derived from a star schema and captures comprehensive patient profiles across five domains: demographic attributes (e.g., age, gender), anthropometric measures (e.g., BMI), lifestyle indicators (e.g., smoking, physical activity, stress), clinical measurements (e.g., glucose, HbA1c, cholesterol, blood pressure), and disease indicators. Features include both categorical and continuous variables, such as normalized age, lipid profiles, inflammatory markers (CRP), and cardiovascular metrics. Disease representation is encoded through binary flags and a composite categorical sublabel capturing all possible combinations of diabetes (DI), heart disease (HT), and hypertension (HY). The dataset was designed to address limitations in single-disease modeling by enabling simultaneous prediction of multiple chronic conditions. It supports the study of shared risk factors and cross-disease interactions, providing a unified feature space for machine learning applications. This facilitates the development of clinical decision-support systems capable of early detection, risk stratification, and holistic patient assessment. Provided as a UTF-8 encoded CSV file, the dataset is compatible with major analytical platforms such as Python, R, and SQL tools. Ethical considerations are addressed through full anonymization of all source data, with no personally identifiable information included. Potential applications include multi-class classification, co-morbidity analysis, feature importance studies, and benchmarking of machine learning models. It also serves as an educational resource for data warehousing and healthcare analytics. Keywords associated with the dataset include chronic disease classification, Medallion Architecture, clinical decision support, and machine learning-based healthcare analytics.

本数据集为支持多分类慢性病预测而构建的集成式医疗数据仓库。其整合了3个来自Kaggle的公开医疗数据集——糖尿病（diabetes）、心脏病（heart disease）与高血压（hypertension）数据集，并通过微软SQL Server（Microsoft SQL Server）实现的梅达永架构（Medallion Architecture，分为Bronze、Silver、Gold三层）完成数据统一整合。最终的Gold层数据集包含280985条患者记录与38项特征，无任何缺失值。每条记录对应一名患者，同时包含二分类标签（正常/异常）与代表疾病组合的8分类子标签，可支撑高级共病分析与预测建模任务。该数据集采用由星型模式衍生的非规范化扁平表结构，覆盖五大维度的完整患者画像：人口统计学属性（如年龄、性别）、人体测量指标（如身体质量指数BMI）、生活方式相关指标（如吸烟行为、体力活动、压力水平）、临床检测指标（如血糖、糖化血红蛋白HbA1c、胆固醇、血压）以及疾病表征指标。特征涵盖分类变量与连续变量，包括归一化年龄、脂质谱、炎症标志物CRP（C反应蛋白）与心血管指标等。疾病表征通过二进制标记与复合分类子标签进行编码，涵盖糖尿病（DI）、心脏病（HT）与高血压（HY）的所有可能组合。本数据集旨在破解单一疾病建模的局限性，支持同时预测多种慢性病。其可用于研究共享风险因素与跨疾病交互作用，为机器学习应用提供统一的特征空间，助力开发可实现早期检测、风险分层与整体患者评估的临床决策支持系统。该数据集以UTF-8编码的CSV文件格式提供，兼容Python、R与SQL工具等主流分析平台。所有源数据均已完成完全匿名化处理，未包含任何个人可识别信息，已满足相关伦理规范要求。其潜在应用场景包括多分类任务、共病分析、特征重要性研究以及机器学习模型基准测试。同时可作为数据仓库与医疗分析领域的教学资源。与本数据集相关的关键词包括慢性病分类、梅达永架构（Medallion Architecture）、临床决策支持以及基于机器学习的医疗分析。

创建时间：

2026-04-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集