five

aiacademy-kg/kg_health_dataset

收藏
Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aiacademy-kg/kg_health_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: patient_id dtype: large_string - name: region dtype: large_string - name: district dtype: large_string - name: gender dtype: large_string - name: age dtype: int64 - name: income_monthly dtype: int64 - name: hospital_type dtype: large_string - name: wait_time_days dtype: int64 - name: treatment_cost_usd dtype: int64 - name: doctor_visits_year dtype: int64 - name: diagnosis dtype: large_string - name: has_insurance dtype: bool - name: satisfaction dtype: int64 splits: - name: train num_bytes: 71267261 num_examples: 400000 download_size: 9834228 dataset_size: 71267261 configs: - config_name: default data_files: - split: train path: data/train-* language: - ru - ky - en tags: - synthetic - healthcare - education - statistics - kyrgyzstan size_categories: - 100K<n<1M --- # 🏥 KG Health 2023 — Synthetic Healthcare Dataset > [!WARNING] > **🇬🇧 SYNTHETIC DATA — FOR EDUCATIONAL PURPOSES ONLY** > This dataset is **entirely synthetic** and was generated programmatically. It does **not** represent real patients, real medical records, or real statistics of any kind. Any resemblance to actual persons, institutions, or events is coincidental. Do not use for medical, policy, or research decisions. > > **🇷🇺 СИНТЕТИЧЕСКИЕ ДАННЫЕ — ТОЛЬКО ДЛЯ ОБУЧЕНИЯ** > Этот датасет является **полностью синтетическим** и был сгенерирован программно. Он **не отражает** реальных пациентов, медицинских записей или статистики. Любое сходство с реальными людьми, учреждениями или событиями случайно. Не использовать для медицинских, политических или исследовательских решений. > > **🇰🇬 СИНТЕТИКАЛЫК МААЛЫМАТТАР — ТЕК ОКУУ МАКСАТЫНДА** > Бул датасет **толугу менен синтетикалык** болуп саналат жана программалык жол менен түзүлгөн. Ал реалдуу бейтаптарды, медициналык жазууларды же эч кандай статистиканы **чагылдырбайт**. Реалдуу адамдарга, мекемелерге же окуяларга окшоштук кокустан болуп саналат. --- ## Dataset Summary A synthetic dataset of **400,000 patient records** simulating a healthcare registry in Kyrgyzstan. It was created for teaching statistics and exploratory data analysis — specifically to practice concepts such as sampling, central tendency, categorical encoding, and data cleaning on a realistic, messy dataset with intentional anomalies. The dataset contains dirty data, outliers, encoding errors, and hidden group patterns, which students are expected to discover and interpret. --- ## Dataset Structure | Column | Description | Type | |---|---|---| | `patient_id` | Unique patient identifier (`KG-XXXXXX`) | string | | `region` | Administrative region of Kyrgyzstan (9 regions) | nominal categorical | | `district` | District within the region (36 districts) | nominal categorical | | `gender` | Patient gender — contains intentional dirty values (~2%) | nominal categorical | | `age` | Patient age in years — contains intentional outliers (~0.5%) | integer | | `income_monthly` | Monthly income in Kyrgyzstani som — heavy right skew | integer | | `hospital_type` | Type of medical facility: `государственная`, `частная`, `НКО/миссия` | nominal categorical | | `wait_time_days` | Days waited before receiving medical care | integer | | `treatment_cost_usd` | Cost of treatment in USD | integer | | `doctor_visits_year` | Number of doctor visits in the past year | integer | | `diagnosis` | Diagnosis category: `профилактика`, `острое`, `хроническое`, `травма`, `неотложное` | ordinal categorical | | `has_insurance` | Whether the patient has health insurance | boolean | | `satisfaction` | Patient satisfaction score, 1–5 | ordinal integer | --- ## Regions Included `Бишкек` · `Чуй` · `Ош (город)` · `Ошская обл.` · `Джалал-Абад` · `Иссык-Куль` · `Нарын` · `Талас` · `Баткен` --- ## Key Properties - **400,000 rows**, 13 columns - Realistic skewed income distribution (lognormal with heavy right tail) - Gender pay gap embedded in income generation (~24%) - Regional disparity in wait times (up to 13× difference between regions) - Intentional dirty data in `gender` (10 variants of the same value) and `age` (impossible values: 999, 150, -1) - Hidden anomalies for students to discover through analysis - Diagnosis probabilities vary by age group and gender --- ## License CC0 1.0 — Public Domain. Free to use for any educational purpose.
提供机构:
aiacademy-kg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作