CK4Gen, High Utility Synthetic Survival Datasets
收藏Figshare2024-11-05 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/CK4Gen_High_Utility_Synthetic_Survival_Datasets/27611388/1
下载链接
链接失效反馈官方服务:
资源简介:
<b>###===###</b><br><b>Overview:</b><br>This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.<br><b>###===###</b><br><b>Description:</b><br>The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.<br><b>#---</b><br><b>GBSG2</b>: Based on Schumacher <i>et al.</i> [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the <i>lifelines</i> package [2], formatted to match the descriptions in Sauerbrei <i>et al. </i>[3], which we treat as the ground truth.<br><b>ACTG320</b>: Based on Hammer <i>et al</i>. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the <i>sksurv</i> package [5], which we treat as the ground truth dataset.<b>WHAS500</b>: Based on Goldberg <i>et al. </i>[6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the <i>sksurv</i> package, which we treat as the ground truth dataset.<b>FLChain</b>: Based on Dispenzieri <i>et al.</i> [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the <i>sksurv</i> package, which we treat as the ground truth dataset.<b>###===###</b><br><b>Notes:</b><br>Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872<br><br>Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." <i>arXiv preprint arXiv:2410.16872</i> (2024).<br><b>###===###</b><b>References:</b><br>[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, <i>Journal of Clinical Oncology</i>, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, <i>Journal of Open Source Software</i>, 2019.<br>[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, <i>British Journal of Cancer</i>, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, <i>New England Journal of Medicine</i>, 1997.<br>[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, <i>Journal of Machine Learning Research</i>, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, <i>American Heart Journal</i>, 1988.<br>[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in <i>Mayo Clinic Proceedings</i>, 2012.
提供机构:
Kuo, Nicholas
创建时间:
2024-11-05



