Novel multi-omics deconfounding variational autoencoders can obtain meaningful disease subtyping
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10458940
下载链接
链接失效反馈官方服务:
资源简介:
TCGA pan-cancer mRNA and DNA data augmented with artificial confounders utilised in "Novel multi-omics deconfounding variational autoencoders can obtain meaningful disease subtyping" by Zuqi Li and Sonja Katz (manuscript in preparation).
The following data curation steps were carried out:
Step 1. Download data from TCGA
R package `TCGAbiolinks`
2547 patients (after step 2) with 6 cancer types:
BRCA (731)
THCA (408)
BLCA (387)
LUSC (297)
HNSC (412)
KIRC (312)
mRNA expression profiles
DNAm expression profiles
Clinical data:
tumor stage: i, ia, ib, ii, iia, iib, iii, iiia, iiib, iiic, iv, iva, ivb, ivc, x
age at diagnosis
race: 'white', 'black or african amarican', 'asian', 'american indian or alaska native'
gender
Step 2. Removal criteria
Patients with
NA or 'not reported' clinical data
race 'american indian or alaska native'
tumor stage x
mRNA and DNAm probes with
0 variance across all included patients
not shared across all cancer types
with missing values
Step 3. Encode clinical vairables and save datasets
mRNA dataset: 2547 patients x 58,456 mRNAs
DNAm dataset: 2547 patients x 232,088 DNAm
clinic dataset: 2547 patients x 6 variables 1. patient ID 2. tumor stage: 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4 3. age at diagnosis 4. race: asian(1), black or african amarican(2), white(3) 5. gender: female(0), male(1) 6. cancer type: BRCA(1), THCA(2), BLCA(3), LUSC(4), HNSC(5), KIRC(6)
Step 4. Pre-process the datasets
mRNA dataset: 'TCGA_mRNAs_processed.csv'
Take the 2000 mRNAs with highest variance
Rescale every feature to [0,1]
--> 2547 patients x 2000 mRNAs
DNAm dataset: 'TCGA_DNAm_processed.csv'
Take the 2000 DNAm with highest variance
Rescale every feature to [0,1]
--> 2547 patients x 2000 DNAm
clinic dataset: 'TCGA_clinic.csv'
Step 5. Simulate confounders (instructions can be found in Methods section of manuscript)
Linear confounder:
'TCGA_confounder_linear.csv' - linear confounding classes
'TCGA_DNAm_confounded_linear.csv' - linearly confounded DNAm data
'TCGA_mRNA2_confounded_linear.csv' - linearly confounded mRNA data
Squared confounder
'TCGA_confounder.csv' - squared confounding classes
'TCGA_DNAm_confounded.csv' - squared confounded DNAm data
'TCGA_mRNA2_confounded.csv' - squared confounded mRNA data
Categorical confounder
'TCGA_confounder_categ2.csv' - categorical confounding classes
'TCGA_DNAm_confounded_categ2.csv' - categorically confounded DNAm data
'TCGA_mRNA2_confounded_categ2.csv' - categorically confounded mRNA data
Multiple confounders - combined effect (linear + squared + categorical)
'TCGA_confounder_multi.csv' - confounding classes for combined effect
'TCGA_DNAm_confounded_multi.csv' - DNAm data with combined effect
'TCGA_mRNA2_confounded_multi.csv' - mRNA data with combined effect
创建时间:
2024-01-19



