A chromosome-level genome for the nudibranch gastropod Berghia stephanieae helps parse clade-specific gene expression in novel and conserved phenotypes

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.6076%252FD1BS33

下载链接

链接失效反馈

官方服务：

资源简介：

How novel phenotypes originate from conserved genes, processes, and tissues remains a major question in biology. Research that sets out to answer this question often focuses on the conserved genes and processes involved, an approach that explicitly excludes the impact of genetic elements that may be classified as clade-specific, even though many of these genes are known to be important for novel, or clade-restricted, phenotypes. This is especially true for understudied phyla such as mollusks, where limited genomic and functional biology resources for members of this phylum has long hindered assessments of genetic homology and function. To address this gap, we constructed a chromosome-level genome for the gastropod Berghia stephanieae (Valdés, 2005) to investigate the expression of clade-specific genes across both novel and conserved tissue types in this species. The final assembled and filtered Berghia genome is comparable to other high quality mollusk genomes in terms of size (1.05 Gb) and number of predicted genes (24,960 genes), and is highly contiguous. The proportion of upregulated, clade-specific genes varied across tissues, but with no clear trend between the proportion of clade-specific genes and the novelty of the tissue. However, more complex tissue like the brain had the highest total number of upregulated, clade-specific genes, though the ratio of upregulated clade-specific genes to the total number of upregulated genes was low. Our results, when combined with previous research on the impact of novel genes on phenotypic evolution, highlight the fact that the complexity of the novel tissue or behavior, the type of novelty, and the developmental timing of evolutionary modifications will all influence how novel and conserved genes interact to generate diversity. Methods Sample preparation and genome sequencing We isolated one Berghia juvenile from the Lyons lab culture prior to mating to minimize genomic contamination. While isolated, we fed the animal ~½ of a medium Exaiptasia diaphana (defined by Taraporevala et al. 2022) each day for 34 days. We then starved the animal for 44 days prior to shipping. To minimize residual food in the gut diverticula, cerata were removed with forceps and the remaining body was blotted on a kimwipe to remove excess water, then the animal was placed in a cryotube and flash frozen in liquid nitrogen and stored at -80 until shipping to Dovetail Genomics (now Catana Bio, Scotts Valley, CA). Technicians at Dovetail Genomics used an input of ~101mg into a slow CTAB protocol, which yielded 12.1µg using the Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) High Sensitivity Kit. They then used a Mini Column for clean up and resuspended the pellet in 75 µl TE. They then quantified DNA samples using the Qubit. They constructed the PacBio SMRTbell library (~20kb) for PacBio Sequel using SMRTbell Express Template Prep Kit V 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer recommended protocol. They then bound the library to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II (PacBio) on 8M SMRT cells with approximately 160x sequence coverage (SRA Accession TBD). Short read RNA sample collection and sequencing We obtained Berghia adult tissue samples, including the: (1) brain, (2 samples; SRA Accessions TBD) (2) oral tentacles (3 samples; SRR12072210, SRA Accessions TBD), (3) rhinophores (3 samples; SRR12072209, SRA Accessions TBD), (4) foot (2 samples; SRR12072206, SRA Accessions TBD), (5) tail (2 samples; SRR12072205, SRA Accessions TBD), and (6) proximal (3 samples; SRR12072208, SRA Accessions TBD) and (7) distal ceras (3 samples; SRR12072207, SRA Accessions TBD). We also obtained earlier stage transcriptome data from: (8) multiple embryonic stages (bulk sample of 500-600 individual embryos from each time point reared at 27°C (12, 24, 36, 48, 60 hpo and 4, 7, and 9 dpo; SRR12072213) and (9) juveniles 15 dpo at 27°C (500 individuals from 3 egg masses laid the same day; SRR12072212). We starved adults for ~1 week prior to removal of some tissues (all but the brain) to reduce symbiont presence and minimize contamination. We extracted total RNA from most adult tissues (minus the brain) using the RNeasy Kit (QIAGEN, Redwood City, CA) and submitted the extracted total RNA to Novogene Ltd. (Sacramento, CA) for quality assessment, library preparation and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads). We prepared the adult brain total RNA using the Clontech SmartSeq v4 Ultra-Low Input RNA Kit (Takara). We prepared libraries with the Nextera XT DNA Library Preparation Kit and 96-Sample Index Kit (Illumina, San Diego, CA) and quantified them using Qubit (ThermoFisher Scientific, Waltham, MA) and assessed quality using a Bioanalyzer (Agilent, Santa Clara, CA). We sequenced the brain sample on the Illumina NextSeq 500 (75bp paired-end reads) at the Genomics Resource Laboratory, University of Massachusetts, Amherst. For the first two samples (bulk embryonic stages and juveniles), total RNA was extracted using TRIzol (Ambion) following the standard protocol, quality was assessed using Tapestation (Agilent) and sent to the IGM UCSD Genomic Center for library preparation (TruSeq mRNA stranded library) and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads). Reference Transcriptome Construction Berghia samples used for reference transcriptome construction included samples: single samples from multiple adult tissues, selected at random, including the (1) brain (SRR12072211), (2) oral tentacle (SRR12072210), (3) rhinophore (SRR12072209), (4) foot (SRR12072206), (5) tail (SRR12072205), (6) proximal (SRR12072208) and (7) distal ceras (SRR12072207), as well as samples (8) embryos (SRR12072213), and (9) juveniles (SRR12072212). We merged all FASTQ output files for a subset of short read RNA-seq samples into two files (Read 1 and Read 2) for downstream analysis. We used default parameters for all programs unless otherwise specified. We trimmed and filtered reads using fastp (version 0.20.0; [1]), and assembled transcripts using Trinity (version 2.9.1; [2]). We predicted open reading frames (ORFs) with TransDecoder (version 5.5.0; [3]). Duplication levels were quite high (~56%), so we clustered predicted ORFs using CD-HIT-EST (version 4.8.1; [4, 5]) at 95% identity and word size of 11 (-c 0.95, -n 11). Post-clustering, we filtered transcripts with alien_index (https://github.com/josephryan/alien_index); based on an algorithm described in [6]). We constructed alien index databases using previously constructed metazoan and non-metazoan databases (obtained from http://ryanlab.whitney.ufl.edu/downloads/alien_index) and all “Symbiodinium” sequences present on UniProt [7] as of 31 March 2020. We removed all sequences with an alien index greater than 45 from the transcriptome. We then compiled full transcripts for each predicted ORF sequence remaining from the assembled transcriptome using a custom Python script (full_transcripts.py). We then scanned the transcriptome for vectors and possible contaminants via the NCBI VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). We removed vectors using a small script (trim_adapters.pl) available through the Trinity Community Codebase (https://github.com/trinityrnaseq/trinity_community_codebase). We removed or trimmed sequences containing contamination using the Contaminants.txt file provided by NCBI and a custom script (remove_contamination.py). Custom scripts are available at https://github.com/lyons-lab/berghia_reference_transcriptome). We assessed transcriptome quality across all steps using BUSCO (version 4.0.5; [8, 9]) scores by comparing assembled transcripts to the metazoa_odb10 and mollusca_odb10 databases. Long read RNA sample collection and sequencing We obtained Berghia adult tissue samples from animals starved for at least four weeks to minimize gut contaminants, including the (1) head (one animal), (2) oral tentacles (two animals), (3) rhinophores (three animals), (4) cerata (one animal), (5) mantle (one animal), and (6) homogenized mid-body tissue (one animal). We also collected two developmental samples, including (1) embryos from the trochophore (72 hours post oviposition; 300 animals) and eyed veliger stages (9-10 days post oviposition; 120 animals), and (2) post-metamorphic and post-feeding juveniles (34 animals). We extracted total RNA using the standard TRIzol Reagent (Life Technologies, Carlsbad, CA, USA) protocol, with some modifications: After the addition of chloroform, we centrifuged samples for 20 minutes at max speed (16,000 RCF) and precipitated samples in 100% isopropanol for ~1 hour at -20°C. We assessed total RNA sample quality on a 1% agarose gel and quantified the RNA in each sample with a Qubit 2.0 High Sensitivity kit (ThermoFisher Scientific, Waltham, MA). We then pooled developmental stages (embryos and juveniles; DEV), adult rhinophore and oral tentacle samples (RHOT), and adult mantle and cerata samples (MCE) in equivalent amounts. We sent these five total RNA samples (DEV, MCE, RHOT, head, mid-body) to the Roy J. Carver Biotechnology Center at the University of Illinois at Urbana-Champaign for IsoSeq library construction (5 libraries) and sequencing. They performed sequencing (on the five pooled libraries) on a single SMRT 8M cell with the PacBio Sequel II (PacBio, Menlo Park, CA) and a 30h movie. They then clustered the raw subreads using the IsoSeq v3 clustering workflow (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md). Genome assembly and scaffolding Bioinformaticians at Dovetail used 167 gigabase-pairs of PacBio CLR reads as an input to WTDBG2 v2.5 [10] with genome size 2.0g, minimum read length 20000, and minimum alignment length 8192. Additionally, they enabled realignment with the -R option and set read type with the option -x sq. They then used BLASTn results of the WTDBG2 output assembly (asm.cns.fa) against the nt database as input for blobtools v1.1.1, and scaffolds identified as possible contamination were removed from the assembly (filtered.asm.cns.fa). Finally, they used purge_dups v1.2.3 [11] to remove haplotigs and contig overlaps (purged.fa). For scaffolding, Dovetail fixed chromatin in place with formaldehyde in the nucleus for extraction and analysis via Dovetail® Omni-C® proximity ligation. They then digested the fixed chromatin with DNAse I, repaired the chromatin ends and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, they reversed crosslinks and purified the DNA. They treated purified DNA to remove biotin that was not internal to ligated fragments. They generated sequencing libraries using NEBNext Ultra enzymes and Illumina-compatible adapters. They then isolated biotin-containing fragments using streptavidin beads before PCR enrichment of each library. Technicians then sequenced the library on an Illumina HiSeqX platform to produce approximately 30x sequence coverage. They then used HiRise MQ>50 reads for scaffolding (see "read-pair" above for figures). Bioinformaticians at Dovetail used input de novo assembly and Dovetail Omni-C library reads as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies [12]. They then aligned Dovetail Omni-C library sequences to the draft input assembly using BWA with default parameters [13, 14]. They then analyzed separations of Dovetail Omni-C read pairs mapped within draft scaffolds by Hi-Rise to produce a likelihood model for genomic distance between read pairs, and used the model to identify and break putative misjoins, to score prospective joins, and make joins above a threshold. We initially filtered the Berghia stephanieae genome with purge_dups v1.2.5 [11] to automatically identify and remove haplotigs and contig/scaffold overlaps from heterozygous sites. Following duplicate purging, we assessed completeness with BUSCO v5.1.2 [9] by comparing to metazoa_odb10 and mollusca_odb10. BUSCO further used the programs HMMER v3.1 [15] and MetaEuk v4.a0f584d [16] for gene prediction and analysis. We then used Nucleotide-Nucleotide BLAST 2.11.0+ [17, 18] to compare our scaffolds to the nt database (downloaded April 2021), and mapped the original PacBio reads used for assembly via minimap2 v2.18-r1015 [19]. With these results, we used BlobToolKit (Challis et al. 2020)(blobtools2 filter option) to remove additional scaffolds considered contamination. Scaffold selection for removal was based on BLASTn hits (we removed no-hit and bacterial contamination) and a minimum size threshold (150 kb). This size threshold was selected because it was the point at which the removal of sequences would not change the BUSCO score, as determined by the use of BlobToolKit Viewer v1.1 [20]. The removed sequences contained differences in GC content and coverage among those that were retained in the final annotated genome. We also performed a Nucleotide-Nucleotide BLAST 2.11.0+ to compare the removed sequences with the final genome to assess duplication rates. Of those sequences removed from the final genome, 92.8% hit to one of the final 18 scaffolds (98.7% of which with an e-value of 0.0), and 1.6% of removed sequences were obvious contaminants. Gene prediction and annotation We analyzed filtered genome scaffolds with RepeatModeler v2.0.1 [21], which used Tandem Repeat Finder (TRF) v4.09 [22], RECON v1.08 [23], RepeatScout v1.0.6 [24], and RepeatMasker v4.1.2 (https://www.repeatmasker.org), to construct a de novo repeat library for Berghia stephanieae. This included the ---LTRStruct flag to run an LTR Structural Analysis, which used GenomeTools v1.6.1 (http://genometools.org), LTR_Retriever v2.9.0 [25], Ninja v1.10.2 (https://github.com/ninja-build/ninja), MAFFT v7.480 [26], and CD-HIT v4.8.1 [4, 5]. We then used this species-specific library to detect repeat sequences (via both soft and hardmasking) with RepeatMasker in the Berghia genome. Following repeat masking, we mapped all short RNA-seq reads (unfiltered) to the hardmasked version of the genome using two separate read mapping software programs, HiSat2 v2.2.1 [27] and STAR v2.7.9a [28]. This was intended to account for mapping bias in order to maximize the possibility for support in our gene annotations. For long read (IsoSeq) mapping, we first obtained Full Length Non-Concatemer (FLNC) reads from step three of the IsoSeq v3 workflow. These FLNC reads were mapped directly to the non-repeatmasked genome using minimap2 v2.22-r1105-dirty [19] with the recommended options according to PacBio (https://github.com/Magdoll/cDNA_Cupcake/wiki/Best-practice-for-aligning-Iso-Seq-to-reference-genome:-minimap2,-deSALT,-GMAP,-STAR,-BLAT). Post-mapping, sam output files were reformatted into bam files and indexed using samtools v1.11[29]. We used BRAKER v2.1.6 [30–32] for preliminary gene prediction of the Berghia genome, which uses GeneMark-EP+ v4 [33, 34], DIAMOND v2.0.8 [35], spaln v2.4.3 [36, 37], and Augustus v3.4.0 [38]. We used long and short read RNA-seq mapping results used as expression support input. We also used a protein hints file generated by combining the mollusca_odb10 database with B.stephanieae sequences identified as BUSCO hits from our initial mollusca_odb10 BUSCO run. We ran BRAKER with the additional flags --etpmode, --gff3 and --softmasking. After initial gene prediction, we generated a filtered predicted gene set using a script included with the BRAKER installation (selectSupportedSubsets.py) and the --anySupport flag to only include genes at least partially supported by hints. Both unfiltered and filtered gene prediction results are provided in Dryad (Dryad DOI TBD), but we only used the filtered set (braker_annotations_anysupport.gff3) in subsequent functional annotation and clade-specific gene analyses. For functional annotation of predicted genes, we used Protein-Protein BLAST 2.11.0+ (BLASTP) and InterProScan v.5.52-86.0. For BLASTP analyses, we used an e-value cutoff of 1e-3 with -max_target_seqs of one against three databases: (1) UniProtKB/Swiss-Prot, (2) RefSeq, and (3) Trembl (all downloaded April 2021). We then combined the hits to all three databases in a single blast annotation file. For InterProScan analyses, we used the default parameters with some additional flags, including: -goterms to look up gene ontology, -dp to disable pre-calculated match lookup, and -t p to indicate protein sequences. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Expanse at the San Diego Supercomputer Center (SDSC) through allocation IDs TG-BIO210019 and TG-BIO210138 [39]. Assessment of clade-specific genes and their expression To determine the distribution of clade-specific genes for Berghia, we created a proteome dataset containing 47 metazoan species using both published genomes and transcriptomes (Additional file 4: Table S3). Our final dataset included 36 molluscs, including fourteen nudibranchs (five from Anthobranchia and nine from Cladobranchia). We downloaded predicted proteomes from genome datasets from MolluscDB [40]. We downloaded the transcriptomes from the NCBI Sequence Read Archive (SRA), which we then filtered with fastp v0.20.0 [1] and assembled with Trinity v2.9.1; [2]. We predicted ORFs using Transdecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder), with default parameters. Our final proteomes ranged from 17,606 to 72,541 proteins (x̅ = 33,691; Additional file 4: Table S3). We identified orthologous gene families among our metazoan proteomes using the OrthoFinder package (Emms and Kelly, 2019) with default parameters and a user generated species tree as input (Additional File 11). Our user generated species tree topology was based on recent metazoan phylogenies [41–44], the MolluscDB phylogeny provided on the website [40], and recent Mollusca [45] and nudibranch [46–48] phylogenetic analyses. We then analyzed orthologous groups using the program KinFin v1.0.3 [49] to determine which predicted genes in Berghia stephanieae are clade-specific (meaning that they only cluster with sequences from a particular clade). We used the default parameters, with additional flags (--infer_singletons --plot_tree -r phylum,class,order,superfamily). To determine the expression patterns of clade-specific genes, we mapped our short read RNAseq data (from the brain, oral tentacles, rhinophores, foot, tail, and proximal and distal ceras) to the Berghia genome using STAR v2.7.9a [28] with default parameters plus additional flags (--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode Basic -sjdbGTFfeatureExon 'CDS'). We counted reads using the command htseq-count from the HTSeq framework v1.99.2 [50], which is a Python package for analysis of high-throughput sequencing data. We analyzed counts using the DESeq function from DESeq2 v1.26.0 [51] to perform differential analysis, and generated the results using the results function.

创建时间：

2023-12-22