five

The genetic architecture of repeated local adaptation to climate in distantly related plants

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.15dv41p57
下载链接
链接失效反馈
官方服务:
资源简介:
Closely related species often use the same genes to adapt to similar environments. However, we know little about why such genes possess increased adaptive potential, and whether this is conserved across deeper evolutionary lineages. Adaptation to climate presents a natural laboratory to test these ideas, as even distantly related species must contend with similar stresses. Here, we re-analyse genomic data from thousands of individuals from 25 plant species as diverged as lodgepole pine and Arabidopsis (~300 My). We test for genetic repeatability based on within-species associations between allele frequencies in genes and variation in 21 climate variables. Our results demonstrate significant statistical evidence for genetic repeatability across deep time that is not expected under randomness, identifying a suite of 108 gene families (orthogroups) and gene functions that repeatedly drive local adaptation to climate. This set includes many orthogroups with well-known functions in abiotic stress response. Using gene co-expression networks to quantify pleiotropy, we find orthogroups with stronger evidence for repeatability exhibit greater network centrality and broader expression across tissues (i.e. higher pleiotropy), contrary to the “cost of complexity” theory. These gene families may be important in helping wild and crop species cope with future climate change, representing important candidates for future study. Methods Dataset selection criteria We selected 29 datasets covering 25 species from 21 studies (see README) for which sequence data were available with either whole genome shotgun data (WGS), pooled sequencing data (POOL), or sequence capture data (CAPTURE) (Table S1). Such datasets provide broad genomic representation and maximise the resolution of the number of genes for each species. We limited our study to datasets with at least five populations distributed along latitudinal, longitudinal, and/or altitudinal gradients. Here, the minimum number of populations was set in order to allow for a minimum baseline of statistical power for Kendall’s τ correlations between environment and allele frequency (see below). The list of included datasets is not exhaustive but rather was selected to achieve a desired number of approximately 20-30 species for which power to detect repeated adaptation has been demonstrated for the methods used here. Datasets also had to provide locality information, either for individual samples or groups of individuals. Species were selected to include both gymnosperms and angiosperms, setting the MRCA of all species sampled at approximately 300 million years ago. SNP-calling We used two SNP-calling pipelines depending on whether sequencing data came from individuals (WGS and CAPTURE) or from pools of individuals (POOL). These pipelines were necessarily different but were based on similar approaches to reduce bioinformatic discrepancies between the data types. For all data types, raw fastq data were retrieved from either the NCBI sequencing read archive (SRA) or the EBI European nucleotide archive (ENA). Accession codes for all data are provided in Table S1. The reference genome used for each species, or closely related species where genomes were not available, are also in Table S1. SNP pipelines were designed to balance computational time and low false-positive rates. The pipeline for individual-based data was as follows, note that selfing and outcrossing species were processed using the same pipeline. Raw fastq files were cleaned and trimmed for adapter sequences using fastp (v0.20.1) before being aligned against the reference genome using bwa-mem (v0.7.17-r1188). BAM files were generated, sorted, and indexed using samtools (v1.16.1), skipping alignments with MAPQ < 10 (-q 10). We then collected quality metrics with Picard Tools (v2.26.3) based on alignment summary (CollectAlignmentSummaryMetrics), insert size metrics (CollectInsertSizeMetrics), and coverage (CollectWgsMetricsWithNonZeroCoverage). We then marked and removed duplicates using Picard’s MarkDuplicates and used AddOrReplaceReadGroups to amend read groups. In some cases, datasets split sequencing data from individual samples across multiple technical replicates, so we then merged BAM files within samples with Picard’s MergeSamFiles. We ran a realignment of the cleaned, merged BAM files by running the RealignerTargetCreator and IndelRealigner from the Genome Analysis Tool Kit (GATK v3.8) and repeated the aforementioned quality metrics on final BAM files. To identify single-nucleotide polymorphisms (SNPs), we generated genotype likelihoods using BCFtools’ mpileup by specifying a minimum mapping quality of 5 for an alignment to be used and retained additional annotation information such as allelic depth (AD). From there, individual pileups were converted into SNP VCFs by using the BCFtools’ call program. We set sample ploidy (-S) information to match the species’ known ploidy and called for genotype quality (GQ) to be reported while excluding any group samples (-G -) information. Finally, we filtered raw VCF files with VCFtools by removing sites with quality value below 30 (--minQ 30), Genotype Quality below 20 (--minGQ 20), and minimum read depth of 5 (--minDP 5), before finally retaining only biallelic (--max-alleles 2) genotypes present in more than 70% of individuals (--max-missing 0.7). For downstream analyses, we performed additional filtering on the basis of minor allele frequency (maf) and minor allele count (mac), retaining only sites with maf > 0.05 and mac > 5, whichever was most stringent. Pooled data was processed using a similarly structured workflow. Raw fastq files were cleaned and trimmed with fastp and aligned to references with bwa-mem using the additional flag to mark shorter split hits as secondary (-M). BAM files were generated, sorted, and indexed using samtools, skipping alignments with MAPQ < 20 (-q 20) and bedfiles were generated from indexed BAM files using BEDtools (v2.27.1). Duplicates were then marked and removed with MarkDuplicates before indel realignment was performed with Picard’s RealignerTargetCreator and IndelRealigner. SNP-calling was then performed using mpileup followed by VarScan’s (v.2.4.2) mpileup2cns program. Variants were called on the basis of minimum read depth of 8 (--min-coverage 8), a p-value threshold of 0.05 (--p-value 0.05), a minimum frequency for calling homozygotes of 80% (--min-freq-for-hom 0.8), ignoring variants with >90% support on one strand (--strand-filter 1), minimum base quality at position of 20 (--min-avg-qual 20) and a minimum variant allele frequency of 0 (--min-var-freq = 0). To extract allele frequencies, pooled SNPs were converted to SNP tables using GATK (v4.1.0.0) VariantsToTable, splitting multi-allelic sites across multiple rows (--split-multi-allelic) and extracting the AF field. Final SNPTables for POOL data were filtered for indels, retaining only biallelic sites, and a minor allele frequency cutoff of 0.05.
创建时间:
2024-07-24
二维码
社区交流群
二维码
科研交流群
商业服务