A Phylogenomic analysis of Genipa (Rubiaceae) using target sequence capture data

NIAID Data Ecosystem2026-05-02 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.7wm37pw0g

下载链接

链接失效反馈

官方服务：

资源简介：

The genus Genipa is a widespread, lowland, Neotropical lineage of trees in the coffee family, Rubiaceae. There is long-standing disagreement on the delimitation of species in the genus and how broadly Genipa is circumscribed. Here, we use genomic data to resolve the classification within Genipa. Using target sequence capture we generated a high resolution 245-locus dataset to produce a comprehensive species phylogeny under the multi-species coalescent model. The phylogenomic results strongly support Genipa spruceana, often synonymised with Genipa americana, as a distinct monophyletic species. Similarly, the monophyly of Genipa infundibuliformis, a recently recognized species, is also strongly supported. The phylogeny also shows three distinct, well-supported clades within the widespread species, Genipa americana. These clades are interpreted as three independently evolving lineages in contrast to the two varieties most commonly recognized in G. americana based on previous morphological studies. Methods Methodology Total genomic DNA was extracted using the NucleoSpin Plant II Kit (Macherey-Nagel, Düren, Germany) or DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). The protocol followed manufacturer’s instructions apart from the cell lysis time, which was increased to overnight to maximise DNA yield. DNA quality was assessed using a NanoDrop 2000 spectrophotometer and quantified using the Qubit 2.0. The NanoDrop 2000 and Qubit 2.0 results were used to determine samples that needed concentration by vacuum centrifugation. Gel electrophoresis was also carried out to assess DNA fragment size. Multiple extraction rounds were pooled as necessary when Initial DNA quantity was low, in order to meet the minimum concentration requirements of Rapid Genomics, Florida, USA who performed target capture library preparation and sequencing. The DNA was mechanically sheared to a size of 200 – 500 base pairs (bp). Illumina libraries were constructed and barcode adapters for the Illumina Sequencing platform were ligated to the libraries then PCR-amplified using standard cycling protocols. Samples were pooled into 16 barcoded libraries with equimolar amounts to a total of 500 ng for hybridization. Target enrichment was performed using the Angiosperms 353 bait set (Johnson et al. 2019) targeting 353 putatively orthologous genes. After enrichment, samples were re-amplified for an additional 6–12 PCR cycles and sequenced using an Illumina NovaSeq 6000 with paired181 end 250 bp reads. The Illumina raw read data was processed using the bioinformatic pipeline SECAPR 2.2.5 (Andermann et al. 2018). The bioinformatic pipeline was run on the Sigma2 High-Performance Computing cluster at NTNU, Norway. Raw sequence data was quality checked using FastQC (Andrews 2010) and MultiQC (Ewels et al. 2016) to gain an overview of sequence quality and determine cleaning parameters. Illumina adapters were removed and cleaning of sequences was carried out using FastP 0.23 (Chen et al. 2018). FastP default settings implemented in SECAPR were: i) the read was cut if the accuracy between adapter and read Phred quality score was below 20; ii) maximum percent of low-quality nucleotides allowed 40 reads with a higher percentage of unqualified (low quality) nucleotides were discarded; iii) size of sliding window for quality trimming 5 nucleotides; iv) trimming from front and tail if quality value was lower than 10; v) reads below complexity threshold of 10 removed; vi) trim poly repeats at end of read of length 7; vii) low complexity filtering was enabled and viii) length filtering was disabled. Quality of cleaned reads was checked, using FastQC, MultiQC and the plotting function in SECAPR. De novo contig assembly was performed on cleaned reads using Spades 3.15.2 (Bankevich et al. 2012). Overlapping sequences were combined into contig sequences using kmer values 21, 33, 55, 77, 99, and 127. The minimum contig length was set to 200, contigs under this threshold were discarded. Contigs belonging to target loci were identified by using Blastn (Camacho et al. 2009) to match the contig sequences with a set of reference sequences for each locus. The reference sequences used were the Gardenia philastrei Pierre ex Pit. Davis, A.P. 4055 (K) sequences from the Royal Botanic Gardens Kew PAFTOL project (Baker et al. 2022). A sequence-match was identified if the sequence matched with at least 80% identity across at least 80% of the contig length. Loci with multiple contig matches were discarded as they may represent paralogous sequences. A multiple species alignment (MSA) was created from the contig data using MAFFT 7.490 (Katoh et al. 2019) for each locus that was recovered across at least three samples with the addition of the “no trim” parameter to keep full contig sequence length. In the next step, reference-based mapping was performed using the consensus sequence of each locus' MSA as a genus-specific reference library. This additional reference assembly leads in general to a more efficient and less biased retrieval of DNA reads across all samples for each locus (Andermann et al. 2018), as opposed to using the recovered contig sequences for each sample. The minimum coverage parameter was set at four reads. Consensus sequences were generated from the reads mapping to the genus-specific reference at each locus for each sample and from these consensus sequences multiple sequence alignments were computed for each locus using MAFFT 7.490 (Katoh et al. 2019). Phylogenetic Analysis Two different phylogenetic methods were used. The first ASTRAL-III (Zhang et al. 2018), which produces a species tree that shares the maximum number of quartet topologies with the input gene trees. The input gene trees were generated in IQ-TREE 2 (Minh et al. 2020). A set of bootstrap consensus maximum likelihood gene trees created using 1000 bootstrap replicates with UFBoot2 (Hoang et al. 2018) and automatic substitution model selection with ModelFinder (Kalyaanamoorthy et al. 2017) implemented in the IQ-TREE 2 software package. The tree was visualised using Figtree v.1.4.3 (Rambaut 2017). The second species phylogeny was produced using Bayesian inference, created with Species Tree And Classification Estimation, Yarely (STACEY; Jones 2017) in BEAST2 (Bouckaert et al. 2019) on the CIPRES Science Gateway web portal (Miller et al. 230 2012). This method simultaneously estimates gene trees and species trees using a birth death collapse model. The input data was a subset of six loci from the de novo contig assembly dataset. The subset selection was numerical, the first six loci in the de novo assembly dataset were selected (5, 9, 20, 43, 55, and 62), with the exception of locus 59, it was excluded from the analysis as it only had seven out of 29 samples. The xml input was generated in BEAUTi 2.6 (Bouckaert et al. 2019). The samples were not preassigned to species and no partitions were selected. The following parameters and priors were selected: species tree model collapse height: 1e-5 237 ; strict clock model: each locus was set as relative to each other; JC69 substitution model; bdcGrowthRate: lognormal (M=5, S=2); collapseWeight: beta (alpha=2, beta=2); population prior log normal (M=-7, S=2); relativeDeathRate: beta (alpha=1, beta=1). The MCMC was run for 100 million generations and Tracer Version v1.7.1 (Rambaut et al. 2018) was used to explore convergence of parameters. The species tree was generated using TreeAnnotator 2.6.3 (Drummond and Rambaut 2007), after discarding 10% as burn-in, and then visualised using Figtree v.1.4.3 (Rambaut 2017).

创建时间：

2024-12-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集