electricsheepafrica/heart-disease-cleveland
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/heart-disease-cleveland
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
task_categories:
- tabular-classification
tags:
- healthcare
- heart-disease
- cardiology
- clinical
- binary-classification
- medical
pretty_name: Heart Disease Dataset (Cleveland UCI)
size_categories:
- n<1K
---
# Heart Disease Dataset (Cleveland UCI)
A cleaned and enriched version of the classic **UCI Cleveland Heart Disease Dataset**, one of the most widely used benchmarks in medical machine learning research. This dataset contains clinical and diagnostic measurements from **302 patients** used to predict the presence or absence of heart disease.
## Dataset Summary
| Property | Value |
|---|---|
| Source | UCI Machine Learning Repository — Cleveland Clinic Foundation |
| Kaggle Source | [johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) |
| Task | Binary Classification |
| Samples | 302 (deduplicated from 1,025 raw rows) |
| Features | 13 input features + 1 target |
| Target | `target` — 0 = No Disease, 1 = Heart Disease |
| Missing Values | None (anomalous encodings flagged) |
| License | Unknown (original UCI data is publicly available for research) |
---
## Background
The original Cleveland Heart Disease data was collected at the **Cleveland Clinic Foundation** and donated to the UCI ML Repository by David W. Aha in 1988. It has been used in hundreds of peer-reviewed publications on cardiovascular risk prediction, explainable AI, and clinical decision support.
The Kaggle version of this dataset contained 1,025 rows — the original 302 unique records inflated with **723 duplicate rows**. This release removes all duplicates and restores the dataset to its canonical 302-sample form, with added human-readable label columns for each categorical feature.
---
## Cleaning & Processing
The following steps were applied to produce `heart_disease_cleaned.csv`:
1. **Deduplication** — Removed 723 duplicate rows, restoring the 302-sample Cleveland dataset.
2. **Anomaly flagging** — Two rows with `thal=0` and four rows with `ca=4` were flagged (boolean columns `thal_flag` and `ca_flag`). These values fall outside the standard Cleveland encoding and may represent missing data in the original collection. They are retained with flags rather than silently dropped.
3. **Label enrichment** — Human-readable label columns were added for all categorical features (see schema below).
4. **Column reordering** — Columns are arranged semantically: demographics → cardiac symptoms → test results → target.
---
## Schema
### Numeric / Encoded Features
| Column | Type | Description |
|---|---|---|
| `age` | int | Age of the patient in years |
| `sex` | int | Sex: 0 = Female, 1 = Male |
| `cp` | int | Chest pain type: 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-Anginal Pain, 3 = Asymptomatic |
| `trestbps` | int | Resting blood pressure (mm Hg) on admission |
| `chol` | int | Serum cholesterol (mg/dl) |
| `fbs` | int | Fasting blood sugar > 120 mg/dl: 0 = No, 1 = Yes |
| `restecg` | int | Resting ECG results: 0 = Normal, 1 = ST-T Wave Abnormality, 2 = Left Ventricular Hypertrophy |
| `thalach` | int | Maximum heart rate achieved (bpm) |
| `exang` | int | Exercise-induced angina: 0 = No, 1 = Yes |
| `oldpeak` | float | ST depression induced by exercise relative to rest |
| `slope` | int | Slope of peak exercise ST segment: 0 = Upsloping, 1 = Flat, 2 = Downsloping |
| `ca` | int | Number of major vessels colored by fluoroscopy (0–3); value 4 flagged as anomalous |
| `thal` | int | Thalassemia: 1 = Normal, 2 = Fixed Defect, 3 = Reversible Defect; value 0 flagged as anomalous |
| `target` | int | **Diagnosis of heart disease: 0 = No Disease, 1 = Heart Disease** |
### Added Label Columns
| Column | Description |
|---|---|
| `sex_label` | Human-readable sex label |
| `cp_label` | Human-readable chest pain type |
| `fbs_label` | Human-readable fasting blood sugar |
| `restecg_label` | Human-readable resting ECG result |
| `exang_label` | Human-readable exercise-induced angina |
| `slope_label` | Human-readable ST slope |
| `thal_label` | Human-readable thalassemia result |
| `target_label` | Human-readable diagnosis |
| `thal_flag` | `True` if `thal=0` (encoding anomaly) |
| `ca_flag` | `True` if `ca=4` (encoding anomaly) |
---
## Class Distribution
| Class | Count | Percentage |
|---|---|---|
| Heart Disease (1) | 164 | 54.3% |
| No Disease (0) | 138 | 45.7% |
The dataset is approximately balanced, making it suitable for standard binary classification without aggressive resampling.
---
## Patient Demographics
| Feature | Value |
|---|---|
| Age range | 29 – 77 years |
| Mean age | 54.4 years |
| Median age | 55.5 years |
| Male patients | 206 (68.2%) |
| Female patients | 96 (31.8%) |
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("electricsheepafrica/heart-disease-cleveland")
df = ds["train"].to_pandas()
print(df["target_label"].value_counts())
```
Or with pandas directly:
```python
import pandas as pd
df = pd.read_csv("heart_disease_cleaned.csv")
# Features and target
X = df[["age","sex","cp","trestbps","chol","fbs","restecg",
"thalach","exang","oldpeak","slope","ca","thal"]]
y = df["target"]
```
---
## Example: Quick Baseline
```python
from datasets import load_dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
ds = load_dataset("electricsheepafrica/heart-disease-cleveland", split="train")
df = ds.to_pandas()
feature_cols = ["age","sex","cp","trestbps","chol","fbs","restecg",
"thalach","exang","oldpeak","slope","ca","thal"]
X = df[feature_cols]
y = df["target"]
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
```
---
## Known Issues & Limitations
- **Encoding anomalies**: 2 rows have `thal=0` and 4 rows have `ca=4`, which are outside the standard Cleveland encoding range. These are flagged with `thal_flag` and `ca_flag` boolean columns.
- **Small sample size**: 302 samples may limit generalization. Consider augmentation or transfer learning for production clinical models.
- **Single-site data**: All patients were collected from the Cleveland Clinic Foundation. External validity to other populations should be evaluated carefully.
- **No temporal data**: This is a cross-sectional snapshot; no longitudinal follow-up is included.
---
## Citation
If you use this dataset, please cite the original UCI source:
```
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989).
International application of a new probability algorithm for the diagnosis of coronary artery disease.
American Journal of Cardiology, 64(5), 304-310.
```
```
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science. http://archive.ics.uci.edu/ml
```
---
## Collection
This dataset is part of the **Electric Sheep Africa — Healthcare Collection**, a curated set of medical and clinical datasets for research and education.
Explore the full collection: [electricsheepafrica](https://huggingface.co/electricsheepafrica)
提供机构:
electricsheepafrica



