Comparing mixed models and Random Forest association tests using naturalGWAS and a Striped Bass SNP dataset

Mendeley Data2024-04-12 更新2024-06-28 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.zw3r22872

下载链接

链接失效反馈

官方服务：

资源简介：

We created phenotypes based on empirical genotypes and location using the r package naturalGWAS. We created phenotypes drawn from 2, 5, 15, or 30 causal variants (SNPs that contribute to the phenotype), at effect sizes of 1, 10, 100, 1000, and 10000. We also generated phenotypes with GxE interactions set to 0.2 and 0.8, where 0.2 represents low amounts of GxE interactions and 0.8 represents high amounts. Gene-by-environment (GxE) interactions refer to interactions between the environment and an individual’s genotype where certain genotypes show greater environmental effects and lower heritability for a trait than others, in contrast to genetic influences or environmental influences each working in isolation (Hunter, 2005; Rutherford, 2000). All simulations were run with 40 confounders taken from the first 40 principal components of the dataset’s genotype matrix. Simulations fell into 6 different 'datasets' and were analyzed using 6 different association tests as outlined in the associated manuscript. Genomic inflation was measured and corrected for using an R script. Power, false discovery rates, and number of false positives were also calculated using an R script. R scripts and shell scripts are included in this Dryad repository and organized as follows: CFAll_m7miss70maf0-01_noRelMiss_LD0-8_HetDepFilt.recode.vcf - VCF containing the genotypes of samples included in 3-population datasets RangeLong_m7miss70maf0-01_AllFilters24Chroms_LD0-8.vcf - VCF containing the genotypes of samples included in 13-population datasets The zip files simulation_scripts.zip contains R scripts and shell scripts used in the above analyses are deposited for each of six 'datasets', as follows: Dataset Folder 3-population dataset, all samples and loci CF 3-population dataset, all samples and causal loci removed CF_noCausal 3-population dataset, admixed samples removed CF_noAdmixed 3-population datset, admixed samples and causal loci removed CF_noAdmixednoCausal 13-population dataset, all loci Range 13-population dataset, causal loci removed Range_noCausal Each folder contains the following files named as follows (with variations when dataset is included in the file name): Simulations.R - Calculates simulated phenotypes based on the provided genotype matrix. Code adapted and expanded from the naturalGWAS vignette found at https://github.com/bcm-uga/NaturalGWAS CATE.R - Implements confounder adjusted testing and estimation (CATE) using the R package cate v1.1. This script also creates the qvals files that subsequent scripts read in and append their results. MLM.R - Implements latent factor mixed model analysis (LFMM) using the R package lfmm v1.0 Oracle.R - Implements the 'ideal' mixed model analysis Oracle for comparison implemented in the R package naturalGWAS v0.1.0 URF.R - Implements Random orest using the R package randomforest v4.6-14 ZhaoRF.R - Inplements Random Forest using a method of correction outlined by Zhao et al. (2012). Linear regressions were implemented using code modified from Brieuc et al. 2018 (dryad doi: https://datadryad.org/stash/dataset/doi:10.5061/dryad.k55hh8f) gemma_README.txt - describes the pipeline for running command-line Gemma on a Linix platform. All associated scripts begin with the word 'gemma'. PowerCalc.R - Accepts files with the following columns: chromosome number, position on chromosome, p-values, and q-values. Given a list of causal loci, calculates power, false discovery rates, and number of false positives using sequential Bonferroni correction and FDR thresholds of 0.05 and 0.2. Reads in files according to rows in the file 'qvaluePowers_Template.csv' (provided) and records calculated values in each row. Also outputs graphs and heatplots of power, false discovery rates, and false positives. fst_calc_workshop.R - Calculates locus-specific FST of causal loci and kruskal-wallace tests comparing correlation of phenotype and location, extraLoci.R - creates a csv file that lists, for every causal locus, every locus within 30k basepairs of that locus. Loci are assumed to be named using the format 'chromosome#_position#'. The distance threshold can be easily modified. col_Names.txt - Text file listing all locus names, used to add column names to genotype matrices sample_names_X.csv - File containing sample names (SampleID) and their population group (Location). coords.tsv - File containing a longtitude and latitude for each individual sample. gen_map_maf0-01_HetDepth.tsv - File containing the chromosome position (#CHROM) and position within that chromosome (POS) of loci in the genotype matrix. The nullScripts folders contain shell scripts that implement the permutation of simulated phenotypes, destroying the association between phenotype and causal loci, training of new Random Forest models. This is repeated 1000 times per phenotype to create a null distribution of importance values, which are then exported for use in p-value and q-value calculation in the above Random Forest naturalGWAS scripts. In addition, the file simulationNames.txt contains a list of all unique simulation names, used in some R Scripts. Also included is the folder Combined_Graphs which contains the following files: CF_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values CFnoAdmix_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values CFnoAdmixCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values CFnoCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values Range_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values RangenoCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values PowerCalc.R - Accepts files with the following columns: chromosome number, position, p-values, and q-values. Given a list of causal loci, calculates power, false discovery rates, and number of false positives using sequential Bonferroni correction and FDR thresholds of 0.05 and 0.2. Reads in files according to rows in the file 'qvaluePowers_Template.csv' (provided) and records calculated values in each row. Also outputs graphs and heatplots of power, false discovery rates, and false positives. figures.R uses a csv file containing power, FDR, and false positive data of simulations to get summary statistics and graphs. qvaluePowers_Template.csv - Copy of the template file used with PowerCalc.R. Column names are as follows: rowname = unique name for each row, Sim = unique name for each set of simulated phenotypes, Test = association test used to produce this row's results, Causal = number of causal loci for the simulated phenotypes, GxE = genotype x environment interaction value where 2 = 0.2 and 8 = 0.8, Effect = effect size used for simulated phenotypes, FP = number of false positives, Power = proportion of causal loci found, FDR = false discovery rate, Corr = method of multiple test correction used when calculating power and false positives (SB = sequential bonferroni, FDR0.05 = false discovery rate 0.5, FDR0.2 = false discovery rate 0.2), Replicate = letter used to identify replicates of causal-GxE-effect parameter value combinations (a-e). qvals_c15ge2e10000b.csv - Example 'qval' file used in several scripts. Column names are as follows: x = row names exported from R, chrom = chromosome on which the locus is located, pos = position on the chromosome in basepairs, cate_pval = p-values calculated by the association test CATE while measuring association of that row's locus to the simulated phenotype, cate_qval = q-values calculated by the association test CATE while measuring association of that row's locus to the simulated phenotype, mlm_pval = p-values calculated by the association test LFMM while measuring association of that row's locus to the simulated phenotype, mlm_qval = q-values calculated by the association test LFMM while measuring association of that row's locus to the simulated phenotype.

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集