five

singhankit16/ICD-10-LLM-generated-Synthetic-Circulatory-System-I00-I99

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/singhankit16/ICD-10-LLM-generated-Synthetic-Circulatory-System-I00-I99
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-classification - text-generation tags: - medical - icd-10 - clinical-notes - medgemma - synthetic - healthcare - diagnosis-coding - cardiology size_categories: - 1K<n<10K pretty_name: Disease of Circulatory System I00-I99 --- # MedGemma ICD-10 Clinical Notes Dataset — Circulatory System Synthetic clinical notes generated by **MedGemma-4B-IT** for fine-tuning ICD-10-CM diagnosis code prediction models. Focused on **Chapter 9: Diseases of the Circulatory System (I00-I99)**. ## Dataset Summary | Split | Examples | Unique ICD-10 Codes | |-------|----------|---------------------| | Train | 6,275 | 1,255 | Each example is a realistic clinical note paired with its ICD-10-CM diagnosis code, formatted as a chat conversation for instruction fine-tuning. ## How It Was Generated Clinical notes were generated using **MedGemma-4B-IT** (Google's medical LLM) loaded locally with 4-bit NF4 quantization — a form of **self-distillation**. For each of the 1,255 billable I-codes in ICD-10-CM 2026, the model generated 5 clinical notes with: - **10 prompt templates** — varying documentation styles (SOAP notes, H&P, progress notes, consultation reports, brief assessments) - **Randomized demographics** — patient ages 18-89, male/female - **Weighted clinical settings** — cardiology outpatient clinic, emergency department, cardiac catheterization lab, inpatient cardiac unit, primary care office, vascular surgery clinic, cardiac rehabilitation center - **No data leakage** — the model was explicitly instructed to never mention ICD codes or state the exact diagnosis name, only describe the clinical presentation Generation took ~80 hours on a single NVIDIA RTX 5070 (12GB VRAM). ## Data Format Each record contains: ```json { "messages": [ { "role": "user", "content": "Given the following clinical note, predict the ICD-10-CM diagnosis code:\n\n<clinical note text>" }, { "role": "assistant", "content": "ICD-10-CM Code: I25.10\nDiagnosis: Atherosclerotic heart disease of native coronary artery without angina pectoris\nShort: Athscl heart disease of native cor art w/o ang pctrs" } ], "code": "I2510", "category": "Circulatory System", "clinical_note": "<raw clinical note text>" } ``` ### Fields | Field | Description | |-------|-------------| | `messages` | Chat-format conversation (user prompt + assistant target) ready for instruction fine-tuning | | `code` | Raw ICD-10-CM code (without dot separator) | | `category` | ICD-10 chapter — always "Circulatory System" for this dataset | | `clinical_note` | The generated clinical note (same text embedded in the user message) | ## Clinical Note Statistics | Metric | Value | |--------|-------| | Average length | 2,021 characters (~280 words) | | Minimum length | 1,202 characters (~170 words) | | Maximum length | 2,549 characters (~360 words) | | Note styles | SOAP, H&P, progress, consultation, assessment | | Augmentation | 5 notes per ICD-10 code | ## ICD-10 Coverage - **Chapter**: 9 — Diseases of the Circulatory System - **Code range**: I00–I99 - **Total billable codes**: 1,255 - **Source**: CMS ICD-10-CM 2026 code descriptions (`icd10cm_order_2026.txt`) Covers conditions including: - Acute rheumatic fever (I00-I02) - Chronic rheumatic heart diseases (I05-I09) - Hypertensive diseases (I10-I16) - Ischemic heart diseases (I20-I25) - Pulmonary heart disease (I26-I28) - Other forms of heart disease (I30-I52) - Cerebrovascular diseases (I60-I69) - Diseases of arteries, arterioles & capillaries (I70-I79) - Diseases of veins & lymphatics (I80-I89) - Other circulatory disorders (I95-I99) ## Intended Use - **Fine-tuning** medical LLMs for automated ICD-10 diagnosis coding - **Benchmarking** clinical NLP models on structured code prediction - **Research** into synthetic medical data generation and self-distillation ## Limitations - **Synthetic data** — generated by an LLM, not sourced from real clinical records - **Single chapter** — covers only Circulatory System (I00-I99), not the full ICD-10-CM - **Single diagnosis** — each note maps to one code; real encounters often have multiple diagnoses - **No validation by medical professionals** — notes may contain clinical inaccuracies ## Loading the Dataset ```python from datasets import load_dataset # Load from local directory dataset = load_dataset("json", data_files="train_data.json") # Or load from Hugging Face Hub (after upload) # dataset = load_dataset("YOUR_USERNAME/medgemma-icd10-circulatory") ``` ## Citation If you use this dataset, please cite the repository: ```bibtex @misc{medgemma_icd10_finetuning, title={Fine-Tuning MedGemma-4B for ICD-10 Diagnosis Coding}, author={singhak-abbvie}, year={2026} } ``` ## Disclaimer This dataset is for **research and educational purposes only**. It is not intended for clinical use without proper validation. Always consult certified medical coders and healthcare professionals for production ICD-10 coding.
提供机构:
singhankit16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作