aiacademy-kg/kg_health_dataset
收藏Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aiacademy-kg/kg_health_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: patient_id
dtype: large_string
- name: region
dtype: large_string
- name: district
dtype: large_string
- name: gender
dtype: large_string
- name: age
dtype: int64
- name: income_monthly
dtype: int64
- name: hospital_type
dtype: large_string
- name: wait_time_days
dtype: int64
- name: treatment_cost_usd
dtype: int64
- name: doctor_visits_year
dtype: int64
- name: diagnosis
dtype: large_string
- name: has_insurance
dtype: bool
- name: satisfaction
dtype: int64
splits:
- name: train
num_bytes: 71267261
num_examples: 400000
download_size: 9834228
dataset_size: 71267261
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- ru
- ky
- en
tags:
- synthetic
- healthcare
- education
- statistics
- kyrgyzstan
size_categories:
- 100K<n<1M
---
# 🏥 KG Health 2023 — Synthetic Healthcare Dataset
> [!WARNING]
> **🇬🇧 SYNTHETIC DATA — FOR EDUCATIONAL PURPOSES ONLY**
> This dataset is **entirely synthetic** and was generated programmatically. It does **not** represent real patients, real medical records, or real statistics of any kind. Any resemblance to actual persons, institutions, or events is coincidental. Do not use for medical, policy, or research decisions.
>
> **🇷🇺 СИНТЕТИЧЕСКИЕ ДАННЫЕ — ТОЛЬКО ДЛЯ ОБУЧЕНИЯ**
> Этот датасет является **полностью синтетическим** и был сгенерирован программно. Он **не отражает** реальных пациентов, медицинских записей или статистики. Любое сходство с реальными людьми, учреждениями или событиями случайно. Не использовать для медицинских, политических или исследовательских решений.
>
> **🇰🇬 СИНТЕТИКАЛЫК МААЛЫМАТТАР — ТЕК ОКУУ МАКСАТЫНДА**
> Бул датасет **толугу менен синтетикалык** болуп саналат жана программалык жол менен түзүлгөн. Ал реалдуу бейтаптарды, медициналык жазууларды же эч кандай статистиканы **чагылдырбайт**. Реалдуу адамдарга, мекемелерге же окуяларга окшоштук кокустан болуп саналат.
---
## Dataset Summary
A synthetic dataset of **400,000 patient records** simulating a healthcare registry in Kyrgyzstan. It was created for teaching statistics and exploratory data analysis — specifically to practice concepts such as sampling, central tendency, categorical encoding, and data cleaning on a realistic, messy dataset with intentional anomalies.
The dataset contains dirty data, outliers, encoding errors, and hidden group patterns, which students are expected to discover and interpret.
---
## Dataset Structure
| Column | Description | Type |
|---|---|---|
| `patient_id` | Unique patient identifier (`KG-XXXXXX`) | string |
| `region` | Administrative region of Kyrgyzstan (9 regions) | nominal categorical |
| `district` | District within the region (36 districts) | nominal categorical |
| `gender` | Patient gender — contains intentional dirty values (~2%) | nominal categorical |
| `age` | Patient age in years — contains intentional outliers (~0.5%) | integer |
| `income_monthly` | Monthly income in Kyrgyzstani som — heavy right skew | integer |
| `hospital_type` | Type of medical facility: `государственная`, `частная`, `НКО/миссия` | nominal categorical |
| `wait_time_days` | Days waited before receiving medical care | integer |
| `treatment_cost_usd` | Cost of treatment in USD | integer |
| `doctor_visits_year` | Number of doctor visits in the past year | integer |
| `diagnosis` | Diagnosis category: `профилактика`, `острое`, `хроническое`, `травма`, `неотложное` | ordinal categorical |
| `has_insurance` | Whether the patient has health insurance | boolean |
| `satisfaction` | Patient satisfaction score, 1–5 | ordinal integer |
---
## Regions Included
`Бишкек` · `Чуй` · `Ош (город)` · `Ошская обл.` · `Джалал-Абад` · `Иссык-Куль` · `Нарын` · `Талас` · `Баткен`
---
## Key Properties
- **400,000 rows**, 13 columns
- Realistic skewed income distribution (lognormal with heavy right tail)
- Gender pay gap embedded in income generation (~24%)
- Regional disparity in wait times (up to 13× difference between regions)
- Intentional dirty data in `gender` (10 variants of the same value) and `age` (impossible values: 999, 150, -1)
- Hidden anomalies for students to discover through analysis
- Diagnosis probabilities vary by age group and gender
---
## License
CC0 1.0 — Public Domain. Free to use for any educational purpose.
提供机构:
aiacademy-kg



