five

Synthetic Synthea patient datasets for lung cancer risk prediction machine learning

收藏
Mendeley Data2024-01-31 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/b24cb4nn8h
下载链接
链接失效反馈
官方服务:
资源简介:
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems. 1. In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost. 2. In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values. Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns. The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).

本合成患者数据集专为肺癌风险预测机器学习(Machine Learning, ML)研究以及学习型医疗系统(Learning Health Systems, LHS)模拟研究而构建。 1. 在"unconverted"子文件夹中:通过Synthea患者生成器生成了5组各3万名患者的队列。从中筛选出约1100名肺癌患者与3000名非肺癌对照患者,并将其电子健康记录(Electronic Health Records, EHR)处理为可直接用于机器学习的表格数据文件,可兼容XGBoost等常见机器学习算法。 2. 在根目录下:将上述5组各3万名患者的数据集按顺序合并,构建出5组规模从3万至15万患者不等的新数据集。新数据集经过重采样处理,保留全部肺癌患者并搭配约3倍数量的对照患者。此类适配机器学习的表格数据文件还将连续型数值转换为分类数值。 由于Synthea生成的患者数据与真实患者数据高度相似,该数据集可用于开发、测试机器学习算法与流程管线,并用于研究人员的培训实践。与真实患者数据不同,此类Synthea数据集可无隐私顾虑地与全球各地的合作者共享。 首项以"基于合成患者数据构建机器学习赋能的学习型医疗系统以开展风险预测"为主题的LHS模拟研究已发表于《Nature Scientific Reports》(详见https://www.nature.com/articles/s41598-022-23011-4)。
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作