心血管疾病预测数据集
收藏阿里云天池2026-06-09 更新2025-10-18 收录
下载链接:
https://tianchi.aliyun.com/dataset/212054
下载链接
链接失效反馈官方服务:
资源简介:
数据集概述
该数据集包含70,000条患者医疗记录,专门用于心血管疾病的风险预测研究。数据集来源于真实的医疗检查数据,涵盖了与心血管健康密切相关的临床指标和生活方式因素。
关键特征变量
数据集包含以下12个关键特征:
人口统计学特征:
age - 患者年龄(岁)
gender - 性别(1: 女性, 2: 男性)
height - 身高(cm)
weight - 体重(kg)
临床测量指标:
ap_hi - 收缩压(mmHg)
ap_lo - 舒张压(mmHg)
cholesterol - 胆固醇水平(1: 正常, 2: 偏高, 3: 很高)
gluc - 血糖水平(1: 正常, 2: 偏高, 3: 很高)
生活方式因素:
smoke - 吸烟习惯(0: 不吸烟, 1: 吸烟)
alco - 饮酒习惯(0: 不饮酒, 1: 饮酒)
active - 身体活动水平(0: 不活跃, 1: 活跃)
目标变量
cardio - 心血管疾病诊断(0: 无疾病, 1: 有疾病)
数据集特点
数据规模:约70,000条记录,适合机器学习建模
类别平衡:正负样本比例接近1:1,避免了类别不平衡问题
特征多样性:包含数值型、类别型和二元特征
现实意义:所有特征都具有明确的临床意义和医学解释性
数据质量
存在少量缺失值(<2%),适合进行数据填充处理
部分连续变量(如血压)包含生理学上可能的异常值
特征间存在一定的相关性,如收缩压与舒张压的高度相关
应用价值
该数据集非常适合用于:
二分类预测模型的开发与比较
特征重要性分析和可解释性AI研究
医疗风险预测模型的构建
机器学习在医疗健康领域的应用案例
挑战性任务
预测个体患心血管疾病的风险概率
识别最重要的风险因素
构建高精度且可解释的预测模型
处理医疗数据中常见的异常值和缺失值
这个数据集因其规模适中、特征丰富且具有明确的现实意义,成为了机器学习竞赛和学术研究中常用的基准数据集之一。
Dataset Overview
This dataset contains 70,000 patient medical records, specifically designed for cardiovascular disease risk prediction research. The dataset is sourced from real medical examination data, covering clinical indicators and lifestyle factors closely related to cardiovascular health.
Key Feature Variables
The dataset includes the following 12 key features:
Demographic Features:
age - Patient age (years)
gender - Gender (1: Female, 2: Male)
height - Height (cm)
weight - Weight (kg)
Clinical Measurement Indicators:
ap_hi - Systolic blood pressure (mmHg)
ap_lo - Diastolic blood pressure (mmHg)
cholesterol - Cholesterol level (1: Normal, 2: Elevated, 3: High)
gluc - Blood glucose level (1: Normal, 2: Elevated, 3: High)
Lifestyle Factors:
smoke - Smoking habit (0: Non-smoker, 1: Smoker)
alco - Alcohol consumption habit (0: Non-drinker, 1: Drinker)
active - Physical activity level (0: Inactive, 1: Active)
Target Variable
cardio - Cardiovascular disease diagnosis (0: No disease, 1: With disease)
Dataset Characteristics
Data Scale: Approximately 70,000 records, suitable for machine learning modeling
Class Balance: Near 1:1 ratio of positive and negative samples, avoiding class imbalance issues
Feature Diversity: Contains numerical, categorical, and binary features
Practical Significance: All features have clear clinical significance and medical interpretability
Data Quality
Contains a small number of missing values (<2%), suitable for data imputation
Some continuous variables (e.g., blood pressure) contain physiologically plausible outliers
Certain correlations exist between features, such as the high correlation between systolic and diastolic blood pressure
Application Value
This dataset is highly suitable for:
- Development and comparison of binary classification prediction models
- Feature importance analysis and interpretable AI research
- Construction of medical risk prediction models
- Case studies of machine learning applications in healthcare
Challenging Tasks
- Predicting the risk probability of an individual developing cardiovascular disease
- Identifying the most important risk factors
- Developing high-precision and interpretable prediction models
- Handling common outliers and missing values in medical data
This dataset has become one of the commonly used benchmark datasets in machine learning competitions and academic research due to its moderate scale, rich features, and clear practical significance.
提供机构:
阿里云天池
创建时间:
2025-10-14
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含70,000条患者医疗记录,涵盖12个与心血管健康相关的临床指标和生活方式因素,适用于心血管疾病风险预测研究。数据集具有规模适中、特征丰富、类别平衡等特点,适合机器学习建模和医疗健康领域的应用研究。
以上内容由遇见数据集搜集并总结生成



