singhankit16/ICD-10-LLM-generated-Synthetic-Circulatory-System-I00-I99
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/singhankit16/ICD-10-LLM-generated-Synthetic-Circulatory-System-I00-I99
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-classification
- text-generation
tags:
- medical
- icd-10
- clinical-notes
- medgemma
- synthetic
- healthcare
- diagnosis-coding
- cardiology
size_categories:
- 1K<n<10K
pretty_name: Disease of Circulatory System I00-I99
---
# MedGemma ICD-10 Clinical Notes Dataset — Circulatory System
Synthetic clinical notes generated by **MedGemma-4B-IT** for fine-tuning ICD-10-CM diagnosis code prediction models. Focused on **Chapter 9: Diseases of the Circulatory System (I00-I99)**.
## Dataset Summary
| Split | Examples | Unique ICD-10 Codes |
|-------|----------|---------------------|
| Train | 6,275 | 1,255 |
Each example is a realistic clinical note paired with its ICD-10-CM diagnosis code, formatted as a chat conversation for instruction fine-tuning.
## How It Was Generated
Clinical notes were generated using **MedGemma-4B-IT** (Google's medical LLM) loaded locally with 4-bit NF4 quantization — a form of **self-distillation**. For each of the 1,255 billable I-codes in ICD-10-CM 2026, the model generated 5 clinical notes with:
- **10 prompt templates** — varying documentation styles (SOAP notes, H&P, progress notes, consultation reports, brief assessments)
- **Randomized demographics** — patient ages 18-89, male/female
- **Weighted clinical settings** — cardiology outpatient clinic, emergency department, cardiac catheterization lab, inpatient cardiac unit, primary care office, vascular surgery clinic, cardiac rehabilitation center
- **No data leakage** — the model was explicitly instructed to never mention ICD codes or state the exact diagnosis name, only describe the clinical presentation
Generation took ~80 hours on a single NVIDIA RTX 5070 (12GB VRAM).
## Data Format
Each record contains:
```json
{
"messages": [
{
"role": "user",
"content": "Given the following clinical note, predict the ICD-10-CM diagnosis code:\n\n<clinical note text>"
},
{
"role": "assistant",
"content": "ICD-10-CM Code: I25.10\nDiagnosis: Atherosclerotic heart disease of native coronary artery without angina pectoris\nShort: Athscl heart disease of native cor art w/o ang pctrs"
}
],
"code": "I2510",
"category": "Circulatory System",
"clinical_note": "<raw clinical note text>"
}
```
### Fields
| Field | Description |
|-------|-------------|
| `messages` | Chat-format conversation (user prompt + assistant target) ready for instruction fine-tuning |
| `code` | Raw ICD-10-CM code (without dot separator) |
| `category` | ICD-10 chapter — always "Circulatory System" for this dataset |
| `clinical_note` | The generated clinical note (same text embedded in the user message) |
## Clinical Note Statistics
| Metric | Value |
|--------|-------|
| Average length | 2,021 characters (~280 words) |
| Minimum length | 1,202 characters (~170 words) |
| Maximum length | 2,549 characters (~360 words) |
| Note styles | SOAP, H&P, progress, consultation, assessment |
| Augmentation | 5 notes per ICD-10 code |
## ICD-10 Coverage
- **Chapter**: 9 — Diseases of the Circulatory System
- **Code range**: I00–I99
- **Total billable codes**: 1,255
- **Source**: CMS ICD-10-CM 2026 code descriptions (`icd10cm_order_2026.txt`)
Covers conditions including:
- Acute rheumatic fever (I00-I02)
- Chronic rheumatic heart diseases (I05-I09)
- Hypertensive diseases (I10-I16)
- Ischemic heart diseases (I20-I25)
- Pulmonary heart disease (I26-I28)
- Other forms of heart disease (I30-I52)
- Cerebrovascular diseases (I60-I69)
- Diseases of arteries, arterioles & capillaries (I70-I79)
- Diseases of veins & lymphatics (I80-I89)
- Other circulatory disorders (I95-I99)
## Intended Use
- **Fine-tuning** medical LLMs for automated ICD-10 diagnosis coding
- **Benchmarking** clinical NLP models on structured code prediction
- **Research** into synthetic medical data generation and self-distillation
## Limitations
- **Synthetic data** — generated by an LLM, not sourced from real clinical records
- **Single chapter** — covers only Circulatory System (I00-I99), not the full ICD-10-CM
- **Single diagnosis** — each note maps to one code; real encounters often have multiple diagnoses
- **No validation by medical professionals** — notes may contain clinical inaccuracies
## Loading the Dataset
```python
from datasets import load_dataset
# Load from local directory
dataset = load_dataset("json", data_files="train_data.json")
# Or load from Hugging Face Hub (after upload)
# dataset = load_dataset("YOUR_USERNAME/medgemma-icd10-circulatory")
```
## Citation
If you use this dataset, please cite the repository:
```bibtex
@misc{medgemma_icd10_finetuning,
title={Fine-Tuning MedGemma-4B for ICD-10 Diagnosis Coding},
author={singhak-abbvie},
year={2026}
}
```
## Disclaimer
This dataset is for **research and educational purposes only**. It is not intended for clinical use without proper validation. Always consult certified medical coders and healthcare professionals for production ICD-10 coding.
提供机构:
singhankit16



