Whole genomes reveal evolutionary relationships and mechanisms underlying gene-tree discordance in Neodiprion sawflies
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.bg79cnpf7
下载链接
链接失效反馈官方服务:
资源简介:
Rapidly evolving taxa are excellent models for understanding the mechanisms that give rise to biodiversity. However, developing an accurate historical framework for comparative analysis of such lineages remains a challenge due to ubiquitous incomplete lineage sorting and introgression. Here, we use a whole-genome alignment, multiple locus-sampling strategies, and summary-tree and SNP-based species-tree methods to infer a species tree for eastern North American Neodiprion species, a clade of pine-feeding sawflies (Order: Hymenopteran; Family: Diprionidae). We recovered a well-supported species tree that—except for three uncertain relationships—was robust to different strategies for analyzing whole-genome data. Nevertheless, underlying gene-tree discordance was high. To understand this genealogical variation, we used multiple linear regression to model site concordance factors estimated in 50-kb windows as a function of several genomic predictor variables. We found that site concordance factors tended to be higher in regions of the genome with more parsimony-informative sites, fewer singletons, less missing data, lower GC content, more genes, lower recombination rates, and lower D-statistics (less introgression). Together, these results suggest that incomplete lineage sorting, introgression, and genotyping error all shape the genomic landscape of gene-tree discordance in Neodiprion. More generally, our findings demonstrate how combining phylogenomic analysis with knowledge of local genomic features can reveal mechanisms that produce topological heterogeneity across genomes.
Methods
DNA was extracted from field-caught larvae. Then, Illumina libraries were prepared and sequenced on an Illumina NextSeq 500 with PE150 reads, which produced 14-27 million reads per individual.
To obtain a multi-genome alignment, we used a pseudo-reference-based approach, with an annotated, reference quality N. lecontei genome (iyNeoLeco1.1 RefSeq GCF_021901455.1) serving as the reference. Briefly, we first used bowtie2 v2.4.1 to map reads from each species to the N. lecontei reference genome. To allow for divergence between reads and the N. lecontei reference, we initially allowed a mismatch in the seed and “local” mapping options in bowtie2. New variants (excluding indels) were incorporated using samtools v1.10 and bcftools v1.10.2. In a second round of mapping, this process was repeated using the first iteration of the genome for each species as the new reference genome. The third round of mapping removed the seed mismatch. The fourth and fifth iterations required end-to-end mapping. After the fifth iteration, we replaced any nucleotide that had a read depth less than 4 or that had excessively high mapping depth (highest 1% of depths for each species) with an “N” using a custom script. All bioinformatics commands and scripts can be found on the LinnenLab GitHub page under the Herrig_etal_NeodiprionPhylogeny repository (https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny and Zenodo). This approach produced FASTA files for each species, with genome sequences in N. lecontei coordinates. All FASTA files are provided here.
We next used the FASTA files to produce additional datasets for analysis. First, we used bedtools v2.30.0 to divide the seven Neodiprion chromosomes into non-overlapping windows of 50 kb. Second, to approximate a dataset of protein-coding genes analogous to an RNAseq or exon-capture phylogenomic dataset, we used gffread v0.11.7 with the –w flag to write fasta files with spliced exons for each transcript for each species using the NCBI Neodiprion lecontei Annotation Release (iyNeoLeco1.1 RefSeq GCF_021901455.1). Windowed and gene datasets are provided as nexus files that contain individual window/gene alignments. These can also be regenerated using scripts (available on GitHub: https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny and Zenodo) to cut up the genome into desired loci (windows or genes) and convert these to nexus format. Third, we called single nucleotide polymorphisms (SNPs) across the entire genome using SNP-sites v2.5.1. We then filtered the data to exclude SNPs that were absent in more than 10% of species and sites with more than two alleles. In addition to analyzing all SNPs (which likely contain tightly linked sites), we produced additional datasets with one SNP sampled every 1 kb, 5 kb, 10 kb, 50 kb, or 100 kb using SNP-sites, with more sparsely sampled SNPs on par with a dataset that might be generated via RADseq. We transformed each of the six datasets into nexus format. All six SNP nexus files are provided and can be used as input for SVDquartets.
To investigate sources of phylogenetic discordance, we also generated estimates of site concordance factors in 50-kb windows and estimated 7 genomic predictor variables in for these same 50-kb windows, including: # parsimony informative sites, # singletons, proportion missing data, GC content, D-statistics, gene density, and recombination rate. This dataset is available as a csv file.
创建时间:
2024-07-08



