five

Trans-ancestral rare variant association study with machine learning-based phenotyping for metabolic dysfunction-associated steatotic liver disease

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14804238
下载链接
链接失效反馈
官方服务:
资源简介:
This respository contains the following files: Raw and processed summary statistics (see explanation below) Ensembl VEP annotations for single variant analyses Phewas results for significant variants and genes Plink and Regenie scripts used to test genetic associations Jupyter notebook used to generate predicted phenotypes in the UK Biobank Phenotype abbreviations in file names: True phenotypes: pdff (true PDFF), masld (true MASLD), and masld_nomri (true MASLD excluding participants with true PDFF in the UK Biobank) Predicted phenotypes: pred (predicted PDFF) and mp (predicted MASLD) Please note that: Across files, the column ID is in the format chromosome:position:reference allele:effect allele. Effect sizes refer to the effect allele. For GWAS summary statistics only, positions are in GRCh37, while for all other summary statistics, positions are in GRCh38. Note that for the Supplementary Tables reporting GWAS results, we lifted GRCh37 positions to GRCh38 using triple-liftOver. All summary statistics include age, sex, alcohol consumption in g/week, body mass index, and 10 principal components of ancestry as covariates. For true MASLD/PDFF analyses, we used a MAC of 5 in each individual dataset, but required a MAC of 10 for the meta-analysis. For predicted MASLD/PDFF analyses, we used a MAC of 10 as we only analyzed UK Biobank data. Due to All of Us policies, we cannot release detailed summary statistics for variants or genes tested in All of Us that include a participant count ≤ 20. We removed these variants and genes from the summary statistics here, including 16:66400873:G:T (CDH5) and 19:7629499:G:T (XAB2) reported in our manuscript. For those variants, please refer to the relevant Supplementary Tables, where we provide masked statistics. We provide summary statistics in two formats: Raw summary statistics (.tsv and .TBL files). These files are not filtered for minor allele counts (MAC), and effect/reference alleles may differ between datasets. In particular, for METAL outputs, Allele1 and Allele2 are not always consistent. For meta-analyses of true phenotypes, we performed both sample size/direction-of-effect meta-analyses (which output Z scores) and standard error meta-analyses (which output betas and standard errors). We used the former to calculate p-values (due to PDFF being continuous and MASLD being binary) and the latter to estimate meta-analyzed effect sizes for forest plots. This is discussed in greater detail in the Supplemental Methods. For gene-level testing, we used the most significant p-value across all tests from the UK Biobank and used the beta and standard error from the standard gene burden test (TEST == ADD). Note that regenie leaves beta and standard error as blanks for non-burden tests. Processed summary statistics (.parquet files). These files are filtered for MAC ≥ 10 and alleles have been flipped where applicable. Ensembl VEP annotations are included. For single variant results, there are three columns called Type, Known, and Rare_QC. To select for rare coding variant that passed QC, you must filter the file for Rare_QC == 1 and Type == Missense or PTV. Known == 1 indicates the variant is in a known MASLD-associated gene that may or may not be rare or coding. For true phenotype analyses, these files contain both sample size/direction-of-effect meta-analysis results (columns N, Z score, p, Het I^2, Het p) and results for individual datasets (suffixes: _1 = UK Biobank PDFF, _2 = UK Biobank MASLD, _3 = All of Us, _4 = Regeneron, _5 = Sema4). For predicted phenotype analyses, these files contain columns called True_Sig, Pheno_Sig, p_meta, p_min, and p_min_cohort. We used True_Sig and Pheno_Sig for post-hoc filtering of predicted phenotype associations. True_Sig indicates at least nominally signficant associations with a true phenotype, while Pheno_Sig = 1 indicates consistent liver enzyme/metabolic dysfunction marker associations. p_meta indicates the p-value from the true phenotype meta-analyses, p_min indicates the minimum true phenotype p-value across datasets, and p_min_cohort indicates the dataset with the minimum p-value.
创建时间:
2025-03-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作