Indicators of Heart Disease (2022 UPDATE)
收藏www.kaggle.com2023-10-12 更新2025-03-25 收录
下载链接:
https://www.kaggle.com/kamilpytlak/personal-key-indicators-of-heart-disease
下载链接
链接失效反馈官方服务:
资源简介:
# Key Indicators of Heart Disease
## 2022 annual CDC survey data of 400k+ adults related to their health status
### What subject does the dataset cover?
According to the [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm), heart disease is a leading cause of death for people of most races in the U.S. (African Americans, American Indians and Alaska Natives, and whites). About half of all Americans (47%) have at least 1 of 3 major risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicators include diabetes status, obesity (high BMI), not getting enough physical activity, or drinking too much alcohol. Identifying and preventing the factors that have the greatest impact on heart disease is very important in healthcare. In turn, developments in computing allow the application of machine learning methods to detect "patterns" in the data that can predict a patient's condition.
### Where did the data set come from and what treatments has it undergone?
The dataset originally comes from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to collect data on the health status of U.S. residents. As described by the [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm): "Established in 1984 with 15 states, BRFSS now collects data in all 50 states, the District of Columbia, and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. The most recent dataset includes data from 2023. In this dataset, I noticed many factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it. I also decided to share with you two versions of the most recent dataset: with NaNs and without it.
### What can you do with this data set?
As described above, the original dataset of nearly 300 variables was reduced to 40variables. In addition to classical EDA, this dataset can be used to apply a number of machine learning methods, especially classifier models (logistic regression, SVM, random forest, etc.). You should treat the variable "HadHeartAttack" as binary ("Yes" - respondent had heart disease; "No" - respondent did not have heart disease). Note, however, that the classes are unbalanced, so the classic approach of applying a model is not advisable. Fixing the weights/undersampling should yield much better results. Based on the data set, I built a logistic regression model and embedded it in an application that might inspire you: https://share.streamlit.io/kamilpytlak/heart-condition-checker/main/app.py. Can you indicate which variables have a significant effect on the likelihood of heart disease?
### What steps did you use to convert the dataset?
Check out this notebook in my GitHub repository: https://github.com/kamilpytlak/data-science-projects/blob/main/heart-disease-prediction/2022/notebooks/data_processing.ipynb
## 关键心脏病指标
## 2022 年美国疾病控制与预防中心 (CDC) 对 40 万余名成年人进行的年度健康状态调查数据
### 本数据集涵盖的主题是什么?
依据美国疾病控制与预防中心 [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm) 的数据,心脏病是美国多数种族人群(包括非裔美国人、美洲印第安人和阿拉斯加原住民以及白人)的主要死亡原因。大约一半的美国民众(47%)至少存在三种主要心脏病风险因素之一:高血压、高胆固醇和吸烟。其他关键指标还包括糖尿病状态、肥胖(BMI 值过高)、缺乏足够的体力活动或饮酒过量。在医疗保健领域,识别和预防对心脏病影响最大的因素至关重要。随着计算技术的发展,机器学习方法的运用得以检测数据中的 '模式',从而预测患者的状况。
### 数据集的来源及其所经历的处理过程
原始数据集来自疾病控制与预防中心,是行为风险因素监测系统(BRFSS)的重要组成部分,该系统通过年度电话调查收集美国居民的健康状况数据。正如疾病控制与预防中心 [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm) 所述:'自 1984 年成立以来,BRFSS 已在所有 50 个州、哥伦比亚特区以及三个美国领地收集数据。BRFSS 每年完成超过 40 万次成年人访谈,成为世界上最大的持续进行的健康调查系统。最新的数据集包括 2023 年的数据。在本数据集中,我发现许多直接或间接影响心脏病的因素(问题),因此我决定从中选择最相关的变量。我还决定与您分享最新数据集的两个版本:包含 NaN 值的和不包含 NaN 值的。
### 该数据集可用于做什么?
如上所述,原始数据集近 300 个变量已缩减至 40 个变量。除了经典的 EDA(探索性数据分析)之外,本数据集还可用于应用多种机器学习方法,特别是分类器模型(逻辑回归、SVM、随机森林等)。您应将变量 'HadHeartAttack' 视作二元变量('是' - 受访者患有心脏病;'否' - 受访者未患有心脏病)。请注意,类别分布不平衡,因此直接应用模型的传统方法并不可取。调整权重/欠采样将产生更好的结果。基于本数据集,我构建了一个逻辑回归模型,并将其嵌入到可能激发您灵感的应用程序中:https://share.streamlit.io/kamilpytlak/heart-condition-checker/main/app.py。您能否指出哪些变量对心脏病发生风险的显著影响?
### 您使用了哪些步骤来转换数据集?
请查看我在 GitHub 仓库中的此笔记本:https://github.com/kamilpytlak/data-science-projects/blob/main/heart-disease-prediction/2022/notebooks/data_processing.ipynb
提供机构:
www.kaggle.com



