Phylogenomic insights into the relationship and the evolutionary history of planthoppers (Insecta: Hemiptera: Fulgoromorpha)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.3ffbg79sc
下载链接
链接失效反馈官方服务:
资源简介:
Planthoppers (Hemiptera: Fulgoromorpha) are a species-rich and globally distributed insect clade with high economic, ecological, and evolutionary importance. However, the relationships among planthopper lineages and families remain unclear. Previous efforts based on inconsistent morphological traits, a few genes, or limited sampling often resulted in conflicting tree topologies. Here, we used genome-level data to assemble 1164 nuclear single-copy genes and 13 mitochondrial protein-coding genes for 149 planthopper species representing 19 out of 21 extant families. Additional markers were added from published mitogenomes, expanding our sampling to 285 species. These markers were used for Maximum Likelihood-based tree inference and dating analyses. The newly inferred phylogenies validated well-accepted relationships and recovered novel placements for several taxa, including the family Achilixiidae and species from the tribe Lyncidini and genus Madagascaritia in Dictyopharidae and Fulgoridae. Based on molecular and morphological evidence, we proposed taxonomic changes including the establishment of a new family Borysthenidae stat. rev. within Delphacoidea and a new superfamily Meenoploidea superfam. nov. with redefined Kinnaridae stat. rev. and Meenoplidae stat. rev.. The time analyses based on 57 nuclear markers and 30 fossils dated the origin of extant Fulgoromorpha back to Guadalupian, Permian (~263 Ma), close to the maximum constraint at 267.3 Ma, while applying an older root constraint resulted in an origin in Mississippian, Carboniferous (~332 Ma). While future sampling of unstudied fauna in unexplored regions or habitats may change the topology, the current phylogenomic analysis will serve as a solid foundation for research into planthopper ecology, evolution, and significance.
Methods
Sample processing and metagenomic sequencing
Individuals of 151 planthopper species from 19 families were sampled ethically following all applicable international and local permitting requirements from their natural habitats around the world (Table S1). The remaining two established families, Gengidae and Hypochthonellidae, known only from limited specimens collected in southern Africa and, to our knowledge, with no molecular data published, are not included due to sampling difficulties. For all individuals, we extracted DNA from either the abdomen or bacteriome – an organ hosting symbiotic bacteria, using various DNA extraction kits (Table S1). Metagenomic libraries of each individual were prepared with NEBNext Ultra II DNA Library Prep Kit (BioLabs, New England) and sequenced on Illumina HiSeq 2500 in a Rapid Run mode (2x250bp), HiSeq X, or NovaSeq 6000 S4 (2x150bp reads) (Table S1).
Single-copy nuclear gene assembly and multiple sequence alignment
We prepared the multiple sequence alignment (MSA) for phylogenetic analysis in the following steps: 1) the preparation of a custom single-copy gene reference database; 2) single-copy gene assembly from metagenomic data; 3) alignment and filtering. The analysis pipeline and the custom Python scripts can be found on the GitHub page (https://github.com/junchen-deng/phylogenomic_analysis_pipeline).
To enable read mapping and gene assembly from generally low-coverage genomic datasets, we first assembled our custom single-copy gene database for planthoppers using BUSCO v5.4.4 (Manni et al., 2021). The hemipteran ortholog database (hemiptera_odb10, v2020-08-05), including 2510 single-copy orthologs, was downloaded from BUSCO. Then, we extracted from GenBank genome assemblies of species covering a wide range of evolutionary distances to planthoppers. This included chromosome-level assemblies from pea aphid Acyrthosiphon pisum (Sternorrhyncha: Aphididae), glassy-winged sharpshooter Homalodisca vitripennis (Cicadomorpha: Clypeata: Membracoidea), sugarcane spittlebug Callitettix versicolor (Cicadomorpha: Clypeata: Cercopidae) and brown planthopper Nilaparvata lugens (Fulgoromorpha: Delphacoidea), and scaffold-level assembly from ‘Zanna’ intricata (synonym to Pyrops intricatus; Fulgoromorpha: Fulgoroidea) (Table S1). We then fed these assemblies to BUSCO and identified 2447, 2098, 2340, 2322, and 1168 complete single-copy genes, respectively. Finally, we combined these five datasets with busco_ancestral from the BUSCO hemiptera_odb10 dataset and generated the custom database – in amino acid – for the next step.
In the second step, we assembled the single-copy genes (orthologs) presented in the custom database using the metagenome of each species. We first cleaned Illumina paired-end reads in each metagenomic dataset by adapter trimming and quality control using trim_galore v0.6.4 (settings: --length 80 -q 30; https://github.com/FelixKrueger/TrimGalore). The quality of clean reads was confirmed by FastQC v0.11.9 (https://github.com/s-andrews/FastQC). Then, HybPiper v2.1.2 was used to assemble single-copy genes in each clean metagenomic dataset (Johnson et al., 2016). HybPiper first performed target enrichment by blasting and mapping Illumina reads against the custom database using DIAMOND v2.0.15 (setting: -t_aa custom_db --diamond --diamond_sensitivity sensitive) (Buchfink et al., 2021). The clustered reads for each single-copy gene were then assembled with SPAdes v3.15.5 implemented in Hybpiper (Prjibelski et al., 2020). Due to the low coverage of the host genome in our metagenomic dataset, we reduced the coverage cutoff for SPAdes assembly to four (setting: --cov_cutoff 4) to increase the contig length without losing much base-level accuracy. This allowed us to assemble all 2510 single-copy genes presented in the custom database from at least some samples.
In the final step, we implemented several rounds of filtering to obtain clean alignments for each ortholog. We first ran hybpiper paralog_retriever to detect and remove potential paralogs. This reduced the dataset to 2313 genes. Then, we ran Fast Statistical Alignment (FSA) with fsa v1.15.9 to generate alignments of amino acids for each gene including the target species and two outgroups (H. vitripennis and C. versicolor) (Table S1; Bradley et al., 2009); FSA allows more reliable identification of non-homologous sequence, which improves overall alignment quality in our case because our data contained some reads of other non-target eukaryotic sources (e.g., symbiotic fungi and insects other than planthoppers, such as putative parasitoids) in our metagenomic data when using a low coverage cutoff for spades assembly in the previous step. Next, trimAl v1.4 was used to trim the amino acid alignments (setting: -automated1) (Capella-Gutierrez et al., 2009). The nucleotide alignments were generated by reverse translating the trimmed amino acid alignments based on the column numbers output by trimAl v1.4 (setting: -colnumbering). Then, we ran other custom Python scripts to filter trimmed alignments by keeping genes that 1) are longer than 200 amino acids and 2) were reconstructed for at least 75% of all target species (i.e. 112 out of 149 species in our case) with sequence length above 25% of the gene length. This further reduced the dataset to 1508 genes. Finally, we ran FastTree v2.1.11 to generate a Maximum Likelihood tree for each gene (setting: -nt -gtr) (Price et al., 2010). We then manually examined each gene tree by looking for unusual topologies (e.g., long branches of a single, few, or a clade of species); long branches of a single or few species can result from short sequence length or contamination while long branches of a clade of species may indicate potential paralogs not detected by Hybpiper. The alignments of gene trees with unusual topologies were further examined manually. The problematic alignments were either cleaned by manually cutting out the troublesome regions or sequences or excluded from the subsequent analyses. At this stage, we also decided to exclude two species, Stenocranus major and Parahydriena hyalina acuta, which were placed repeatedly with outgroups due to significant metagenome contamination from the DNA of putative parasitoid insects. In the end, we reached a final list of 1164 single-copy orthologs from 149 species.
Mitogenome assembly, annotation, and mitochondrial marker extraction
The clean Illumina reads obtained in previous steps after adaptor trimming and quality control were assembled using MEGAHIT v1.1.3 with k-mer sizes from 99 to 255 (Li et al., 2016). We then used NanoTax.py (https://github.com/junchen-deng/NanoTax), which performs blast searches using assembled contigs against a custom database containing previously published mitogenomes, to identify mitochondrial contigs. Complete or partial mitogenomes of 149 out of 151 planthopper species were assembled from metagenomes. The mitogenomes of S. major and P. hyalina acuta from contaminated metagenomes were successfully assembled and verified while the mitogenomes of two other species, Bebaiotes sp. and Omolicna uhleri, failed to assemble (Table S1). The identified mitochondrial contigs were then annotated with a custom Python script modified from (Łukasik et al., 2019). The script first extracted all the Open Reading Frames (ORFs) and their amino acid sequences from a mitogenome. Then, the ORFs were searched recursively using HMMER v3.3.1 (Eddy, 2011), through custom databases containing manually curated sets of planthopper mitochondrial protein-coding and rRNA genes. The rRNA genes were searched with nhmmer implemented in HMMER v3.3.1 (Wheeler and Eddy, 2013), and tRNAs were identified with tRNAscan-SE v2.0.7 (Chan et al., 2021). All 13 mitochondrial protein-coding genes (PCGs) were annotated and extracted for each species. Additionally, we extracted mitochondrial markers from previously published mitogenomes of 134 planthopper species on GenBank (published before February 2024; Table S1), mostly from Eastern Palearctic and Oriental taxa not well-represented in the primary dataset. Mitogenomes of four outgroups from Cicadomorpha, including the keeled treehopper Entylia carinata (superfamily Membracoidea), the cicada Tettigades undata (superfamily Cicadoidea), the meadow spittlebug Philaenus spumarius (superfamily Cercopoidea), and the leafhopper Macrosteles quadrimaculatus (superfamily Membracoidea), were included in the annotation (Table S1). The amino acid alignment of each mitochondrial marker was inferred with Mafft v7.475 (settings: --localpair --maxiterate 1000; Katoh and Standley, 2013). The nucleotide alignments were generated by reverse translation of amino acid alignments as described in the previous steps.
Phylogenetic inference
We prepared four concatenated datasets, with both amino acid and nucleotide alignments included in each dataset, for the phylogenetic analyses. These were 1) 13 mitochondrial PCGs (length: 6.7 kbp) from 149 species sequenced in this study; and 2) the same mitochondrial genes from 282 species including mitogenomes downloaded from GenBank; 3) the 1164 nuclear markers (length: 1.13 Mbp) from 149 species sequenced in this study; and 4) the combined dataset of 13 mitochondrial and 1164 nuclear markers (length: 1.14 Mbp) from 285 species. We reasoned that the two mitochondrial datasets would allow us to assess how the increasing sampling improves the phylogeny and make a more direct comparison with published phylogenies that were mostly based on mitochondrial markers.
To evaluate whether our datasets have evolved under globally stationary, reversible, and homogeneous (SRH) conditions, we applied matched-pairs tests of homogeneity (Bowker’s test) implemented in SymTest v2.0.55 (https://github.com/ottmi/symtest) to test for potential compositional heterogeneity (Jermiin et al., 2004; Misof et al., 2014). Tests were applied on 1) the amino acid dataset, 2) the nucleotide dataset with all codon positions, and 3) the nucleotide dataset including first and second codon positions only. The results indicate that the amino acid dataset exhibited much less compositional heterogeneity compared to the nucleotide datasets, of which the dataset containing only the first and second codon positions showed less among-lineage heterogeneity than the one including all codon positions (Fig. S1). Additionally, the third codon position possibly suffers strong substitution saturation after >200 million years of evolution (Johnson et al., 2018). Taking together, we used the amino acid dataset and the nucleotide dataset including only the first and second codon positions for the subsequent phylogenetic tree inference.
All datasets were partitioned by genes. All phylogenies were inferred with the Maximum Likelihood (ML) approach in IQ-TREE2 v2.2.0.3 (Minh et al., 2020). For nucleotide datasets, all available DNA models were tested in IQ-TREE2. For amino acid datasets, selected protein models Q.insect, WAG, LG, and JTT (setting: -mset Q.insect, WAG, LG, JTT) were tested on nuclear datasets and the model mtInv (setting: -mset mtInv) was tested on mitochondrial datasets for computational efficiency. To decide on the partitioning scheme and the substitution model, we asked IQ-TREE2 to perform extended model selection on each gene with free rate heterogeneity followed by tree inference and subsequently merge two or more genes until the model fit does not increase any further (setting: -m MFP+MERGE; Chernomor et al., 2016; Kalyaanamoorthy et al., 2017). The best partitioning scheme and the best-fit models were chosen by the highest BIC (Bayesian Information Criterion) scores. Bootstrapping was conducted using the approximate likelihood ratio test (SH-aLRT) and ultrafast bootstrap methods with 1000 replicates as well as the approximate Bayes test (setting: -B 1000 --alrt 1000 --abayes; Anisimova et al., 2011; Anisimova and Gascuel, 2006; Hoang and Chernomor, 2017). All other setting options were kept default.
To account for the differences in the evolutionary history of genes, we inferred the species tree from gene trees using the Accurate Species Tree ALgorithm (ASTRAL-III) (Zhang, 2018). Specifically, we used Weighted ASTRAL (wASTRAL v1.15.2.3) to take into account phylogenetic uncertainty by integrating signals from branch length and branch support in gene trees (Zhang, 2022). We used the nucleotide alignment of each gene without the third codon position as the input to IQ-TREE2, which inferred each gene tree with 1000 replicates of ultrafast bootstrapping. Then, the combined tree file was analyzed in wASTRAL with 16 rounds of search and subsampling (setting: -R).
Divergence time estimation with fossils
We used the nuclear dataset comprising the first and second codon positions for the molecular dating analyses. From all 1164 nuclear markers, we filtered out genes with less than 143 out of 149 (i.e. <95%) species present. This results in 57 markers. Then, we concatenated these alignments into one dataset. In addition, we made five gene subsets out of the 57 genes, each of which contains 11 to 12 randomly assigned markers. Each subset was concatenated into a separate dataset. We then decided on the best partitioning scheme and substitution model for each dataset in IQ-TREE2 (setting: -m MF+MERGE).
For node calibration, we assigned 30 fossils as calibration points throughout the tree (Table S2). The taxonomic placement of each fossil was determined based on the apomorphy of the corresponding taxon. Fossil taxa were selected based on their verified taxonomic and chronostratigraphic placement, to cover as low classification levels as possible, especially for the lineages present in molecular analyses. The fossil age was used as the minimum constraint of the node. Maximum constraints were chosen based on the phylogenetic placement, estimated dates from previously published studies, and the fossil history of certain clades. For the root, the oldest fossils assigned to the extant lineages of Fulgoromorpha are Barremixius petrinus and Karebopodoides aptianus of family Cixiidae, dated to the Lower Cretaceous, Barremian (125.77 - 121.4 Ma; Luo et al., 2021; Maksoud et al., 2017). Thus, we placed the minimum constraint of the root at the calibrated age of the two fossils (125 Ma). For the maximum constraint of the root, we referred to the extinct superfamilies Coleoscytoidea, Surijokocixioidea, and Fulgoridioidea, for which the oldest fossils are known from the Permian, Roadian (273.15 - 266.5 Ma), the Permian, Wordian (267.3 - 264.12 Ma), and the Lower Jurassic, Hettangian (201.6 - 199.2 Ma), respectively (Bucher et al., 2024; Szwedo, 2018; Szwedo et al., 2004). It is generally accepted that Coleoscytoidea is a side group, distantly related to the modern crown Fulgoromorpha, while Surijokocixioidea and Fulgoridioidea are considered as stem groups to the modern crown group (Bourgoin and Szwedo, 2023, 2022; Szwedo, 2018). However, whether Surijokocixioidea or Fulgoridioidea is more closely related to the crown group is still under debate (Bourgoin and Szwedo, 2023; Brysz and Szwedo, 2019; Szwedo, 2017). We chose the maximum constraint of the root at the base of Wordian (267.3 Ma) where the oldest fossil of Surijokocixioidea resides. Johnson et al. (2018) estimated the root age of crown Fulgoromorpha at ~206 Ma with a 95% confidence interval (hereafter: CI) ranging from 178 to 241 Ma. Thus, our constraints can be considered relatively conservative (Table S2).
Regarding the maximum constraint for fossils in different modern families, the oldest fossils came from the Lower Cretaceous, Barremian (125.77 - 121.4 Ma) as mentioned above and all fossils from earlier periods, such as Jurassic (201.6 - 145 Ma), belong to the extinct, more ancient stem groups such as Fulgoridioidea. Thus, we placed the maximum constraint at the base of Jurassic (201.6 Ma) for fossils in families Cixiidae, Delphacidae, Kinnaridae, Achilidae, Derbidae, Fulgoridae, and Dictyopharidae. As for other relatively recently diverged families, since there are no clear patterns in fossil distribution, we referred to their host plants. Flowering plants, which are the main food source of all modern crown group planthoppers, first appeared in the fossil record in the early Cretaceous and began to diversify in the mid to later Cretaceous (Zuntini et al., 2024). It is very likely that the diversification of flowering plants drove the diversification of the derived planthopper families. Therefore, we placed the maximum constraint at the base of the Cretaceous (145 Ma) for families Tropiduchidae, Lophopidae, Nogodinidae, Issidae, Caliscelidae, and Ricaniidae (Table S2). The ages of all geologic periods were derived from the ICS International Chronostratigraphic Chart v2023/09 (http://www.stratigraphy.org/ICSchart/ChronostratChart2023-09.pdf).
We also applied a different set of maximum constraints based completely on previously inferred time trees. We referred to the work done by Johnson et al. (2018) where they produced a dated phylogeny for major Hemipteran clades including nine planthopper families. We chose the upper 95% confidence interval of the estimated age of the node that goes back one node deeper in the phylogeny as the maximum constraint for the target family. For example, the split between Acanaloniidae and Flatidae was dated to 76 Ma (95% CI: 58 - 96 Ma), and the split between the Acanaloniidae-Flatidae clade and Caliscelidae was dated to 90 Ma (95% CI: 70 - 109 Ma). Then, we chose 109 Ma as the maximum constraint for fossils in Flatidae and Acanaloniidae. Following this rule, we placed the maximum constraint at 241 Ma for fossils of Cixiidae, Delphacidae, Kinnaridae, Achilidae, Derbidae, Fulgoridae, and Dictyopharidae, 144 Ma for Tropiduchidae, and 123 Ma for Lophopidae, Nogodinidae, Issidae, Caliscelidae, and Ricaniidae. The maximum constraint of the root was set to 349 Ma as the upper 95% confidence interval of the estimated age of auchenorrhynchans (309 Ma, 95% CI: 275 - 349 Ma) (Table S2). The alternative constraints are more conservative for the root and relatively old families and less conservative for the derived families than our constraints based on fossil history and host plants.
The molecular dating analyses were performed in BEAST v2.7.6 (Bouckaert and Vaughan, 2019). We placed monophyly constraints on the nodes where both SH-aLRT and ultrafast bootstrap support were >95. For each dataset, the partition and its site model were set according to the best scheme found by IQ-TREE2 and were estimated later in BEAST. The rates of the branches were modeled under the optimized relaxed clock (Douglas et al., 2021). The uniform prior was applied to the root. For calibration points, we applied three different priors: uniform prior and log-normal prior with two different standard deviations (S = 1.0 or 0.5). For log-normal prior, we gave an offset equal to the minimum age and an adjusted mean (M) placing the maximum age at the 97.5% percentile. Other priors were kept default. Each dataset was run three times independently on each prior setting with a Markov Chain Monte Carlo (MCMC) chain length of at least 50 million to ensure good mixing. In addition, a run without sequence data (‘sample from prior’ in BEAST) was also performed for each dataset to examine the effective priors, compared to the specified priors. MCMC results were accessed with Tracer v1.7.2 (Rambaut et al., 2018). The verified log and tree files of the three independent runs using the same dataset were combined using LogCombiner v2.7.4 with 10% burnin. The maximum clade credibility tree of each dataset with mean node heights was summarised from the combined tree set by TreeAnnotator v2.7.4. All trees were visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). The node ages were extracted using the treeio R package (Wang et al., 2019). The summary of node ages was illustrated using Processing 3 (http://www.processing.org). All figures were edited in Inkscape (https://inkscape.org).
创建时间:
2024-12-28



