Supporting data for: The de novo genome of the Black-necked Snakefly (Venustoraphidia nigricollis Albarda, 1891): A resource to study the evolution of living fossils

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.kwh70rz9h

下载链接

链接失效反馈

官方服务：

资源简介：

Snakeflies (Raphidioptera) are the smallest order of holometabolous insects that have kept their distinct and name-giving appearance since the Mesozoic, probably since the Jurassic, and possibly even since their emergence in the Carboniferous, more than 300 million years ago. Despite their interesting nature and numerous publications on their morphology, taxonomy, systematics, and biogeography, snakeflies have never received much attention from the general public, and only a few studies were devoted to their molecular biology. Due to this lack of molecular data, it is therefore unknown, if the conserved morphological nature of these living fossils translates to conserved genomic structures. Here, we present the first genome of the species and of the entire order of Raphidioptera. The final genome assembly has a total length of 669 Mbp and reached a high continuity with an N50 of 5.07 Mbp. Further quality controls also indicate a high completeness and no meaningful contamination. The newly generated data was used in a large-scaled phylogenetic analysis of snakeflies using shared orthologous sequences. Quartet score and gene-concordance analyses revealed high amounts of conflicting signals within this group that might speak for substantial incomplete lineage sorting and introgression after their presumed re-radiation after the asteroid impact 66 million years ago. Overall, this reference genome will be a door-opening dataset for many future research applications, and we demonstrated its utility in a phylogenetic analysis that provides new insights into the evolution of this group of living fossils. Methods The de novo reference genome was sequenced with PacBio HiFi reads. All HiFi reads were assembled using hifiasm 0.16.1 (Cheng et al., 2021; Cheng et al., 2022). Raw primary contigs were filtered for contamination using blobtools 1.1.1 (Laetsch & Blaxter, 2017). The filtered contigs were then polished using all HiFi reads. This was done by first mapping the HiFi reads to the filtered contigs using minimap 2.24 with options "-a -x map-hifi". The mapping results were sorted by coordinates using samtools 1.15 with options "-l 9 -O BAM". Duplicates were removed using picard 2.26.10 MarkDuplicates (https://github.com/ broadinstitute/picard) with the option "--REMOVE_DUPLICATES". The assembly fasta file and the duplicate filtered bam file were indexed with samtools faidx and samtools index, respectively. Variants were identified using DeepVariant 1.2 (https://github.com/google/deepvariant) with the option "--model_type=PACBIO". Resulting heterozygous variants were filtered out with bcftools 1.15 (Danecek et al., 2021) using the command "view" with the option "-f 'PASS' -i 'GT="1/1" --no-version -Oz". The compressed vcf file was then indexed using tabix from HTSlib 1.15 (Bonfield et al., 2021). Finally, bcftools consensus was used to generate the polished contigs from the filtered hifiasm contigs and the filtered variant set. Repeats specific to V. nigricollis were identified using RepeatModeler 2.0.1 (Flynn et al., 2020) in combination with RepeatMasker 4.1.0 (www.repeatmasker.org/RepeatMasker/), RECON 1.08 (Bao & Eddy, 2002), RepeatScout 1.0.6 (Price et al., 2005), Tandem Repeats Finder 4.10 (Benson, 1999) and RMBlast 2.11.0+ (www.repeatmasker.org/rmblast/). RepeatModeler was run with the options “‑pa 16 ‑LTRStruct. Resulting repeat families were combined with all Hexapoda repeat sequences from RepBase release 27.06 (Bao et al., 2015) and used as input for RepeatMasker 4.1.2 together with the options "-xsmall -no_is -e ncbi -pa 16 -s". A soft masked genome assembly was used for gene annotation as implemented in the BRAKER3 pipeline (Gabriel et al., 2023). This approach combines a de novo gene calling, transcriptome-based gene annotation using the transcriptome of V. nigricollis (Vasilikopoulos et al., 2020), and a homology-based gene annotation. For protein references, we combined the Arthropoda-specific protein collection from OrthoDB following the recommendations in the BRAKER user guide (www.github.com/Gaius-Augustus/BRAKER). The resulting proteome was tested for completeness using BUSCO v.5.4.75.3.1 (Manni et al., 2021) in “protein mode” and run against the insect-specific set of core genes. Functional annotation was done using InterProScan v5 (Jones et al., 2014). Phylogenetic reconstruction was performed using the BUSCO-to-Phylogeny wrapper function (Schneider et al., 2021), the applied code is available on (www.github.com/mag-wolf/BUSCO-to-Phylogeny). Publicly available transcriptome data (Table S1) of other Raphidioptera species were downloaded from NCBI SRA, and short reads were assembled using Trinity v2.8.5 (Grabherr et al., 2011) with default parameters. The resulting transcriptomes, as well as the genome assembly constructed here, were annotated using the BUSCO v5.4.3 (Manni et al., 2021) function in short mode and restricted to the insecta_odb10 dataset of OrthoDB (Kriventseva et al., 2019). We extracted single copy orthologous sequences (SCOS) with no more than 25 % missing species and orthologous sequences were aligned using Mafft v7.475 (Katoh & Standley, 2013) with 1000 iterative refinements. Alignments were trimmed using ClipKit v1.1.3 (Steenwyk et al., 2020) in the “kpic-smart-gap” mode to allow for an additional smart-gap-based trimming. Based on the trimmed alignments, gene trees were constructed using IQtree v2.1.2 (Minh et al., 2020) with 1000 bootstrap replications each. We further filtered gene trees and alignments based on the maximum likelihood genetic distance calculated by IQtree. To do this, we removed orthologs in the 5 % and 95 % quantiles to avoid including misalignments and sequences with too little information for a meaningful tree construction. References Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ et al. 2023. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nature Biotechnology 41: 232–238. Bao W, Kojima KK, Kohany O. 2015. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6: 11. Bao Z, Eddy SR. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12: 1269–1276. Benson G. 1999. Tandem repeats finder. A program to analyze DNA sequences. Nucleic acids research 27: 573–580. Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, Keane T, Davies RM. 2021. HTSlib. C library for reading/writing high-throughput sequencing data. GigaScience 10. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18: 170–175. Cheng H, Jarvis ED, Fedrigo O, Koepfli K-P, Urban L, Gemmell NJ, Li H. 2022. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40: 1332–1335. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM et al. 2021. Twelve years of SAMtools and BCFtools. GigaScience 10. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences of the United States of America 117: 9451–9457. Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. 2023. BRAKER3. Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G et al. 2014. InterProScan 5. Genome-scale protein function classification. Bioinformatics (Oxford, England) 30: 1236–1240. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7. Improvements in performance and usability. Molecular biology and evolution 30: 772–780. Kriventseva EV, Kuznetsov D, Tegenfeldt F, Manni M, Dias R, Simão FA, Zdobnov EM. 2019. OrthoDB v10. Sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research 47: D807-D811. Laetsch DR, Blaxter ML. 2017. BlobTools. Interrogation of genome assemblies. F1000Research 6: 1287. Li H. 2018. Minimap2. Pairwise alignment for nucleotide sequences. Bioinformatics (Oxford, England) 34: 3094–3100. Manni M, Berkeley MR, Seppey M, Zdobnov EM. 2021. BUSCO. Assessing Genomic Data Quality and Beyond. Current protocols 1: e323. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Haeseler A von, Lanfear R. 2020. IQ-TREE 2. New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular biology and evolution 37: 1530–1534. Price AL, Jones NC, Pevzner PA. 2005. De novo identification of repeat families in large genomes. Bioinformatics (Oxford, England) 21 Suppl 1: i351-8. Schneider C, Woehle C, Greve C, D'Haese CA, Wolf M, Hiller M, Janke A, Bálint M, Huettel B. 2021. Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola). GigaScience 10. Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. 2020. ClipKIT. A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biology 18: e3001007. Vasilikopoulos A, Misof B, Meusemann K, Lieberz D, Flouri T, Beutel RG, Niehuis O, Wappler T, Rust J, Peters RS et al. 2020. An integrative phylogenomic approach to elucidate the evolutionary history and divergence times of Neuropterida (Insecta. Holometabola). BMC Evolutionary Biology 20: 64.

创建时间：

2023-12-05