SIMpat: a synthetic benchmark for similarity metrics on patient representations
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10830065
下载链接
链接失效反馈官方服务:
资源简介:
Introduction
We used Synthea to generate six cohorts of patients with certain specified disease. Please refer to Synthea documentation for the generation process.
We selected 6 different diseases that could be generated by Synthea, that were deemed by a medical professional as “different enough”. The goal of this simulation is to find a metric that can differentiate between patients.
The six conditions are:
Cerebral Palsy (SNOMED-CT code : 128188000 - Cerebral palsy (disorder))
Colorectal Cancer (SNOMED-CT code : 93761005 - Primary malignant neoplasm of colon (disorder))
Dialisys (SNOMED-CT code : 265764009 - Renal dialysis (procedure))
Hypertension (SNOMED-CT code : 59621000 - Essential hypertension (disorder))
Breast Cancer (SNOMED-CT code : 254837009 - Malignant neoplasm of breast (disorder))
Prostate Cancer (SNOMED-CT code : 126906006 - Neoplasm of prostate (disorder))
NB:
Dialisys is not a disorder, but a condition, but is used here as a proxy for renal issue
Synthea doesn’t have a module to generate prostate cancer in men, but only prostate cancer in veteran, hence this is the module used here (all men with prostate cancer are veterans)
We use those cohort to compare the ability of 12 different distance metrics to separate patients.
Those 12 metrics are split in three groups :
Sementic based metrics:
AvgEmb* method encodes text by averaging the pre-trained word embeddings of all the words present in it.
BERT* uses bidirectional transformer based neural model to solve the task of masked language modeling.
Universal Sentence Encoders (USE)* use transformer based encoders to encode sentences into embedding vectors.
Embeddings from Language Models (ELMo)* uses bi-directional LSTM based encoders to encode a sentence into a fixed size representation
Graph based metrics:
DeepWalk* uses random walks to generate sequences of vertices (vertex sentences) which are subsequently fed to a skip-gram model to learn the embeddings corresponding to the vertices.
Node2Vec* uses biased random walks to optimize a neighborhood preserving objective function such that the nodes which are highly interconnected and the nodes with similar roles in the graph are closer in the embedding space.
LINE* tries to directly optimize the vertex embeddings based on one hop and two hop random walk probabilities.
HARP* proposes a meta-strategy for embedding vertices of a graph such that they preserve the higher-order structural features.
Bags of findings^
Average Links^
Average Links Weighted by Information Content (IC)^
Path Distance weighted by IC^
Concept followed by a * are extracted from this paper and can be downloaded here
Concept followed by a ^ were develloped by Jean-Virgile Voegeli (SIMED)
Descriptive analysis of the sample
We will first look at the cohort that were created by Synthea. The cohorts were created using the seed 123456789 for reproducibility.
For this first experiment, Synthea was asked to generate 100 alive individuals for each specific disease. We asked Synthea to keep only 10 years of history. Each individual was set to be between the age of 18 and 80 years old. Except for specific sex-disease such as breast cancer and prostate cancer, all cohorts contains both male and female individuals. We used the default location, which is Massachussetts.
One important note on age. The Synthea modules sometimes specify a minimum age to onset a certain condition / disease. For example, colorectal cancer can only onset after 50 years old, and prostate cancer after 60 years old.
Each Synthea run was set to run 10.000 times. If after 10.000 tries, the software didn’t manage to generate a patient that fit the criterion (here, a specific snomed code), the run would fail. Synthea can also generate patients that dies before the “run date”, and if this happens will simulate another patient.
This explains why we have cohorts of more than 100 individuals but less than 100 alive individuals. We can also have in certain cases a little above 100 individuals. This is due to the fact that the synthea generator is multicore, and patients are generated simultaneously.
创建时间:
2024-08-28



