Exploring the genotype-to-phenotype map using quantifiable patterns in metazoan genomic and morphological data

NIAID Data Ecosystem2026-05-10 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.7h44j1050

下载链接

链接失效反馈

官方服务：

资源简介：

A prevailing problem in evolutionary biology is elucidating the “genotype-phenotype map” that characterizes how genomic activities regulate different aspects of organismal morphology and their variability in both space and time. Here, we explore potential causality between genome content and both morphological complexity and disparity by compiling the regulatory components (i.e., transcription factors, RNA binding proteins, and microRNA families) as well as a representative set of non-regulatory “housekeeping genes” in 32 species belonging to a wide variety of animal phyla, altogether encapsulating a number of varying morphological, ecologic and genomic characteristics. A principal component analysis of these four non-overlapping genomic components from each of these 32 species in relation to their last common ancestor revealed that no relationship exists between genome space and disparity, as changes to animal body plans appear to be largely the result of changes to the gene regulatory networks that govern animal development rather than gaining or losing specific sets of regulatory genes. However, using both phylogenetically correlated as well as phylogenetically uncorrelated statistical tests, we find a strong relationship between the loss of all considered gene types in some parasitic taxa, an exacerbation of a trend that characterizes animal genomes in general. We also find a strong and likely causal relationship between microRNA innovations and organismal complexity. While this analysis of genomic features suggests how complexity and disparity are each encoded in the genome, further analysis of the regulatory networks in which they participate should provide a more comprehensive description of how organisms diversify their morphologies through time. Methods Materials and Methods Gene Complements The transcription factors for human were taken from Messina et al. (2004) and for fruit fly D. melanogaster these sequences were taken from FlyBase (https://flybase.org/reports/FBgg0000745.htm) (Supp. File 1; all supplemental files can be found at https://doi.org/10.5061/dryad.7h44j1050). The regulatory RNA-binding proteins for human were taken from Gerstberger et al. (2014) and for fruit fly from Gamberi et al. (2006) (Supp. File 2). Regulatory RNA-binding proteins were discerned from non-regulatory RNA-binding proteins by examining the “GO: Biological Function” for each gene and compiling those with clear regulatory roles in either co-transcriptional or post-transcriptional processing in either human or fruit fly (see Supp. File 2). Octopus transcription factors were annotated by searching the proteome with hmmscan program from HMMER v 3.3.2 with –cut_ga option against the list of transcription factor Pfam models (Mistry et al. 2020). DNA repair proteins started with the curated lists of Wood et al. (2001, 2005) (Supp. File 3). To compile the ancestral repertoire of regulatory and DNA repair proteins present in the bilaterian last common ancestor versus those that evolved afterwards in either the human, fruit fly, or (for the transcription factors) the octopus lineage, first orthologues of each gene in human were searched initially using Ensembl’s “orthologue” function in three other vertebrate species, the mouse Mus musculus, the chicken Gallas gallus and the spotted gar Lepisosteus oculatus, as well as any paralogues identified using Ensembl’s “paralogue” function (see Supp. Fig. 1 for a methods flowchart). Each gene plus all paralogues were then mapped to one of the four gnathostome-specific sub-genomes derived from one of the ~17 ancestral chordate linkage groups (Simakov et al. 2020; Lamb 2021; Nakatani et al. 2021) (Supp. File 2). This allowed us to distinguish between paralogues generated by the two whole-genome duplication events in early vertebrate history (Dehal and Boore 2005) from more ancient gene duplications potentially shared with one or more invertebrate taxa. To search for paralogues in the invertebrate taxa, we used a combination of reciprocal blastp (using the default settings at NCBI) (Tatusov et al. 1997) and phylogenetic analysis (with sequences aligned using Muscle and the phylogeny generated with Neighbor-joining using uncorrected distances using the default settings in Macvector v. 18.2.5) against the publicly available predicted proteins from the sequenced genomes (assembly accession for each species is given in Supp. File 4). In addition, we searched using reciprocal blastp all available non-bilaterian taxa for each of our considered sequences. This allowed us to polarize potential gains within the Bilateria versus losses in one or more bilaterian sub-taxa. The predicted proteins for all but three taxa were taken directly from GenBank. The predicted proteins from the acoel H. miamia were taken from Ensembl https://metazoa.ensembl.org/Hofstenia_miamia/Info/Index) using the default blastp settings. The predicted proteins from the polyclad flatworm P. crozieri (Leite et al. 2022)** and the xenoturbellid X. bocki (Schiffer et al. 2023) were searched using BLAT. Multiple hits that were not clearly delineated as proteins derived from alternative splicing (e.g., those labeled isoform X1, X2 etc.) were searched against the assembled genome using the default settings in blastn. If a blastp search recovered two or more results, each of these results were first checked against the genome using tblastn using the default settings to make sure that each protein sequence was encoded from a separate locus. Then, we confirmed using reciprocal blastp that each of these protein sequences was equally related to the single query sequence. The number of loci were then recorded for each gene for each taxon. All sequences for all invertebrate taxa minus D. melanogaster (which are available at FlyBase) are available as supplemental materials. To test the robustness of our orthology assignments, we used the chromosomal location of orthologues between vertebrates and an invertebrate, as it has recently been shown how the 17 chordate linkage groups relate to the 24 bilaterian linkage-groups present in this same last common ancestor, and how these 24 bilaterian linkage groups are distributed across the chromosomes of select invertebrates including the sea urchin Lytechinus variegatus and the abalone Haliotis asinina (Simakov et al. 2022). Using tblastn, we searched the genome of the L. variegatus using the protein sequences curated from the sea urchin S. purpuratus and catalogued the chromosomal location of the 435 putative orthologues whose location was specific to a single chordate linkage-group (Supp. File 5). We repeated this exercise for H. asinina using the protein sequences curated from the abalone H. rufescens and catalogued 446 putative orthologues again specific to a single chordate linkage group (Supp. File 6). Aside from the 7 orthologues present on chromosome 6 in L. variegatus and on chromosome 14 in H. asinina that corresponds to an ancestral linkage group that was lost in chordates but whose genes were dispersed throughout the chordate genome (Simakov et al. 2022), only two genes had inconsistent locations between the chordate linkage group and the bilaterian linkage group of both L. variegatus and H. asinina (Fig. 1, boxed genes) and thus represent potential mis-identified orthologues. However, the remaining genes all demonstrate a one-to-one correspondence between the chordate linkage group and the bilaterian linkage group in at least one of the two analyzed non-chordate species, and thus a near complete correspondence between our results using reciprocal blast versus synteny, this congruence strongly supports the methodology underlying our orthology assignments. The gains and losses of 765 miRNA families found in at least one of the 32 queried species were taken directly from MirGeneDB v3.0 (Clark et al. 2024) and are summarized in Supplemental File 7. Finally, the curated 954 BUSCO gene data set (Simão et al. 2015; Manni et al. 2021) was assembled for each of the 32 sampled species (BUSCO v5.4.3, Metazoa node, Supp. File 3). Twelve of these BUSCO genes overlapped either the curated set of transcription factors or the curated set of regulatory RNA-binding proteins and were removed from the BUSCO file generating a set of 945 non-regulatory and non-redundant housekeeping genes. An additional 30 genes that encode DNA repair proteins were also removed for the analyses shown in Fig. 2D so that no gene was in both the BUSCO set as well as in any one of the other three curated gene sets. Genome space The fraction of gain and loss of each of the four gene sets for each of the 32 terminal taxa were calculated with respect to the bilaterian last common ancestor. These values are given in Supplemental File 8 and were used for the principal components analysis. An alternative ancestor was also explored, one where the xenacoelomorphs (Xenoturbella and the acoel H. miamia) are nested within the deuterostomes and allied with the echinoderms and hemichordates. An alternative principal components analysis found near-identical results (Supp. Fig. 2) and thus we only show the results with xenacoelomorphs reconstructed as basal bilaterians (see the text for further discussion). Principal components analyses were performed on the covariance matrix for the eight genomic variables using the ‘pcacov’ function in Matlab v.2024a (Mathworks Inc.) (all scripts are available as supplemental materials). Principal components scores for all taxa were calculated from the first three eigenvectors. Variable loadings and percent of total variance explained are given in Table 2. Principal component scores for the first three components were calculated for each taxon and used in subsequent analyses. Pearson product-moment correlations and linear regressions of principal components and original variables were calculated using SAS v.9.4 (SAS Institute). Morphospace The original Deline et al. (2018) metazoan morphological dataset consists of 212 extant terminal taxa from 34 animal phyla coded with 1,767 discrete characters. The operational taxonomic units varied in rank from family to phylum, but most represented Linnean orders. Deline et al. (2018) mapped character contingencies to analytically differentiate between missing, absent, and non-applicable characteristics. We culled the full morphological dataset to include only the operational taxonomic units corresponding to the 32 genetically characterized genera in the current study. In several instances, multiple taxa fell within the same morphological operational taxonomic unit (e.g. Homo and Mus within placental mammals), in those cases both taxa were coded identically in terms of morphology. Even with the coarse nature of the morphological datasets, the morphology displayed at the generic level was consistent with that of the higher taxonomic group. The exceptions are D. gyrociliatus, which is morphologically distinctive from its associated operational taxonomic unit (Scolecida) and whose coding was based on (Martín-Durán et al., 2021), and the spider mite T. urticae, whose coding was altered from the tick I. scapularis based on the characters provided in OConnor (OConnor 2009). In total, we added five characters to the original dataset to both capture autapomorphic characters within added species (i.e. D. gyrociliatus) or to differentiate distinctive taxa within broad operational taxonomic units (e.g. ticks and mites) (see Supp. Files 9 and 10). The resulting dataset of 32 taxa and 1,769 characters was analyzed following the methods of Deline et al. (2018) (Supp. File 11). The current analysis only differed in methodology from the previous study in the use of a principal-coordinate eigenvector-based ordination rather than non-metric multidimensional scaling, to use consistent methods to those utilized with the genomic data. The resulting morphospace is consistent with that of the previous study and the previous study found that morphospaces generated with these different ordination methods were highly correlated (Deline et al. 2018). The morphological analyses were conducted in R (4.3.2) using the cluster (Maechler 2018) and ape (Paradis and Schliep 2018) packages. Estimation of Molecular Branch Lengths and Phylogenetic Independent Contrasts The estimation of branch lengths was derived from a data set consisting of 19 concatenated protein sequences and assembled for each of the 32 ingroup taxa as well as the outgroups Nematostella vectensis, Corticium candelabrum and Amphimedon queenslandica (Supp. File 12). These proteins were chosen because of their ability to capture the correct phylogeny of a small subset of canonical model system taxa (yeast (human (fly, C.elegans))) or to not strongly support an alternative arrangement (Mushegian et al. 1998; Wang et al. 1999). Further, subsequent work has shown that at least some of these proteins appear to evolve in a clock-like manner (Peterson et al. 2004, 2008). Branch lengths were calculated for each single in-group taxon as well as the three outgroup taxa (see Supp. Fig. 3B) using maximum likelihood (PAUP 4.0a) using the JTT matrix with 4 rate categories and a Gamma distribution with shape parameter = 0.5. The significance of the relationship between DNA-repair proteins and branch lengths is independent of the model of sequence evolution. To generate phylogenetically independent contrasts analyses, we used Maclade (v. 4.08a) to generate a topology that conforms to the currently accepted phylogeny of these 32 select species (Laumer et al. 2019) (Supp. File 13). Using this same alignment and topology, we used MEGA7 (Kumar et al. 2016) to estimate divergence times incorporating well-constrained calibration points (Erwin et al. 2011). Phylogenetic independent contrasts analyses and phylogenetic regressions were performed using the ape, phytools, and nlme packages of R (Revell and Harmon 2022). The use of alternative topologies (e.g., Xenacoelomorpha as ambulacrarian deuterostomes) had no significant effect on the resulting analysis.

创建时间：

2025-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集