five

Novel multi-omics deconfounding variational autoencoders can obtain meaningful disease subtyping

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10458940
下载链接
链接失效反馈
官方服务:
资源简介:
TCGA pan-cancer mRNA and DNA data augmented with artificial confounders utilised in "Novel multi-omics deconfounding variational autoencoders can obtain meaningful disease subtyping" by Zuqi Li and Sonja Katz (manuscript in preparation). The following data curation steps were carried out:  Step 1. Download data from TCGA R package `TCGAbiolinks` 2547 patients (after step 2) with 6 cancer types: BRCA (731) THCA (408) BLCA (387) LUSC (297) HNSC (412) KIRC (312) mRNA expression profiles DNAm expression profiles Clinical data: tumor stage: i, ia, ib, ii, iia, iib, iii, iiia, iiib, iiic, iv, iva, ivb, ivc, x age at diagnosis race: 'white', 'black or african amarican', 'asian', 'american indian or alaska native' gender Step 2. Removal criteria Patients with NA or 'not reported' clinical data race 'american indian or alaska native' tumor stage x mRNA and DNAm probes with 0 variance across all included patients not shared across all cancer types with missing values  Step 3. Encode clinical vairables and save datasets mRNA dataset: 2547 patients x 58,456 mRNAs DNAm dataset: 2547 patients x 232,088 DNAm clinic dataset: 2547 patients x 6 variables    1. patient ID    2. tumor stage: 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4    3. age at diagnosis    4. race: asian(1), black or african amarican(2), white(3)    5. gender: female(0), male(1)    6. cancer type: BRCA(1), THCA(2), BLCA(3), LUSC(4), HNSC(5), KIRC(6)      Step 4. Pre-process the datasets mRNA dataset: 'TCGA_mRNAs_processed.csv' Take the 2000 mRNAs with highest variance Rescale every feature to [0,1] --> 2547 patients x 2000 mRNAs DNAm dataset: 'TCGA_DNAm_processed.csv' Take the 2000 DNAm with highest variance Rescale every feature to [0,1] --> 2547 patients x 2000 DNAm clinic dataset: 'TCGA_clinic.csv' Step 5. Simulate confounders (instructions can be found in Methods section of manuscript) Linear confounder: 'TCGA_confounder_linear.csv' - linear confounding classes 'TCGA_DNAm_confounded_linear.csv' - linearly confounded DNAm data 'TCGA_mRNA2_confounded_linear.csv'  - linearly confounded mRNA data Squared confounder 'TCGA_confounder.csv' - squared confounding classes 'TCGA_DNAm_confounded.csv' - squared confounded DNAm data 'TCGA_mRNA2_confounded.csv'  - squared confounded mRNA data Categorical confounder  'TCGA_confounder_categ2.csv' - categorical confounding classes 'TCGA_DNAm_confounded_categ2.csv' - categorically confounded DNAm data 'TCGA_mRNA2_confounded_categ2.csv'  - categorically  confounded mRNA data Multiple confounders - combined effect (linear + squared + categorical) 'TCGA_confounder_multi.csv' - confounding classes for combined effect 'TCGA_DNAm_confounded_multi.csv' - DNAm data with combined effect 'TCGA_mRNA2_confounded_multi.csv'  - mRNA data with combined effect
创建时间:
2024-01-19
二维码
社区交流群
二维码
科研交流群
商业服务