electricsheepafrica/heart-disease-cleveland

Name: electricsheepafrica/heart-disease-cleveland
Creator: electricsheepafrica
Published: 2026-04-08 23:20:26
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/electricsheepafrica/heart-disease-cleveland

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown task_categories: - tabular-classification tags: - healthcare - heart-disease - cardiology - clinical - binary-classification - medical pretty_name: Heart Disease Dataset (Cleveland UCI) size_categories: - n<1K --- # Heart Disease Dataset (Cleveland UCI) A cleaned and enriched version of the classic **UCI Cleveland Heart Disease Dataset**, one of the most widely used benchmarks in medical machine learning research. This dataset contains clinical and diagnostic measurements from **302 patients** used to predict the presence or absence of heart disease. ## Dataset Summary | Property | Value | |---|---| | Source | UCI Machine Learning Repository — Cleveland Clinic Foundation | | Kaggle Source | [johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) | | Task | Binary Classification | | Samples | 302 (deduplicated from 1,025 raw rows) | | Features | 13 input features + 1 target | | Target | `target` — 0 = No Disease, 1 = Heart Disease | | Missing Values | None (anomalous encodings flagged) | | License | Unknown (original UCI data is publicly available for research) | --- ## Background The original Cleveland Heart Disease data was collected at the **Cleveland Clinic Foundation** and donated to the UCI ML Repository by David W. Aha in 1988. It has been used in hundreds of peer-reviewed publications on cardiovascular risk prediction, explainable AI, and clinical decision support. The Kaggle version of this dataset contained 1,025 rows — the original 302 unique records inflated with **723 duplicate rows**. This release removes all duplicates and restores the dataset to its canonical 302-sample form, with added human-readable label columns for each categorical feature. --- ## Cleaning & Processing The following steps were applied to produce `heart_disease_cleaned.csv`: 1. **Deduplication** — Removed 723 duplicate rows, restoring the 302-sample Cleveland dataset. 2. **Anomaly flagging** — Two rows with `thal=0` and four rows with `ca=4` were flagged (boolean columns `thal_flag` and `ca_flag`). These values fall outside the standard Cleveland encoding and may represent missing data in the original collection. They are retained with flags rather than silently dropped. 3. **Label enrichment** — Human-readable label columns were added for all categorical features (see schema below). 4. **Column reordering** — Columns are arranged semantically: demographics → cardiac symptoms → test results → target. --- ## Schema ### Numeric / Encoded Features | Column | Type | Description | |---|---|---| | `age` | int | Age of the patient in years | | `sex` | int | Sex: 0 = Female, 1 = Male | | `cp` | int | Chest pain type: 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-Anginal Pain, 3 = Asymptomatic | | `trestbps` | int | Resting blood pressure (mm Hg) on admission | | `chol` | int | Serum cholesterol (mg/dl) | | `fbs` | int | Fasting blood sugar > 120 mg/dl: 0 = No, 1 = Yes | | `restecg` | int | Resting ECG results: 0 = Normal, 1 = ST-T Wave Abnormality, 2 = Left Ventricular Hypertrophy | | `thalach` | int | Maximum heart rate achieved (bpm) | | `exang` | int | Exercise-induced angina: 0 = No, 1 = Yes | | `oldpeak` | float | ST depression induced by exercise relative to rest | | `slope` | int | Slope of peak exercise ST segment: 0 = Upsloping, 1 = Flat, 2 = Downsloping | | `ca` | int | Number of major vessels colored by fluoroscopy (0–3); value 4 flagged as anomalous | | `thal` | int | Thalassemia: 1 = Normal, 2 = Fixed Defect, 3 = Reversible Defect; value 0 flagged as anomalous | | `target` | int | **Diagnosis of heart disease: 0 = No Disease, 1 = Heart Disease** | ### Added Label Columns | Column | Description | |---|---| | `sex_label` | Human-readable sex label | | `cp_label` | Human-readable chest pain type | | `fbs_label` | Human-readable fasting blood sugar | | `restecg_label` | Human-readable resting ECG result | | `exang_label` | Human-readable exercise-induced angina | | `slope_label` | Human-readable ST slope | | `thal_label` | Human-readable thalassemia result | | `target_label` | Human-readable diagnosis | | `thal_flag` | `True` if `thal=0` (encoding anomaly) | | `ca_flag` | `True` if `ca=4` (encoding anomaly) | --- ## Class Distribution | Class | Count | Percentage | |---|---|---| | Heart Disease (1) | 164 | 54.3% | | No Disease (0) | 138 | 45.7% | The dataset is approximately balanced, making it suitable for standard binary classification without aggressive resampling. --- ## Patient Demographics | Feature | Value | |---|---| | Age range | 29 – 77 years | | Mean age | 54.4 years | | Median age | 55.5 years | | Male patients | 206 (68.2%) | | Female patients | 96 (31.8%) | --- ## Usage ```python from datasets import load_dataset ds = load_dataset("electricsheepafrica/heart-disease-cleveland") df = ds["train"].to_pandas() print(df["target_label"].value_counts()) ``` Or with pandas directly: ```python import pandas as pd df = pd.read_csv("heart_disease_cleaned.csv") # Features and target X = df[["age","sex","cp","trestbps","chol","fbs","restecg", "thalach","exang","oldpeak","slope","ca","thal"]] y = df["target"] ``` --- ## Example: Quick Baseline ```python from datasets import load_dataset from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score import pandas as pd ds = load_dataset("electricsheepafrica/heart-disease-cleveland", split="train") df = ds.to_pandas() feature_cols = ["age","sex","cp","trestbps","chol","fbs","restecg", "thalach","exang","oldpeak","slope","ca","thal"] X = df[feature_cols] y = df["target"] clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy") print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") ``` --- ## Known Issues & Limitations - **Encoding anomalies**: 2 rows have `thal=0` and 4 rows have `ca=4`, which are outside the standard Cleveland encoding range. These are flagged with `thal_flag` and `ca_flag` boolean columns. - **Small sample size**: 302 samples may limit generalization. Consider augmentation or transfer learning for production clinical models. - **Single-site data**: All patients were collected from the Cleveland Clinic Foundation. External validity to other populations should be evaluated carefully. - **No temporal data**: This is a cross-sectional snapshot; no longitudinal follow-up is included. --- ## Citation If you use this dataset, please cite the original UCI source: ``` Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64(5), 304-310. ``` ``` Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml ``` --- ## Collection This dataset is part of the **Electric Sheep Africa — Healthcare Collection**, a curated set of medical and clinical datasets for research and education. Explore the full collection: [electricsheepafrica](https://huggingface.co/electricsheepafrica)

提供机构：

electricsheepafrica

5,000+

优质数据集

54 个

任务类型

进入经典数据集