Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14002644

下载链接

链接失效反馈

官方服务：

资源简介：

Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis". Victor Tobiasson, Jacob Luo, Yuri I Wolf, Eugene V Koonin Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA The Origin of eukaryotes is one of the key problems in evolutionary biology. The demonstration that the Last Eukaryotic Common Ancestor (LECA) already contained the mitochondrion, an endosymbiotic organelle derived from an alphaproteobacterium, and the discovery of Asgard archaea, the closest archaeal relatives of eukaryotes inform and constrain evolutionary scenarios of eukaryogenesis. We undertook a comprehensive analysis of the origins of the core eukaryotic genes tracing to the LECA within a rigorous statistical framework centered around evolutionary hypotheses testing using constrained phylogenetic trees. The results reveal dominant contributions of Asgard archaea to the origin of most of the conserved eukaryotic functional systems and pathways. A limited contribution from Alphaproteobacteria was identified, primarily relating to the energy transformation systems and Fe-S cluster biogenesis, whereas ancestry from other bacterial phyla was scattered across the eukaryotic functional landscape, without consistent trends. These findings suggest a model of eukaryogenesis in which key features of eukaryotic cell organization evolved in the Asgard ancestor, followed by the capture of the Alphaproteobacterial endosymbiont, and augmented by numerous but sporadic horizontal acquisition of genes from other bacteria both before and after endosymbiosis. Version 0.3, updated 180325 Main data repository for: Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. Contains all final parsed data from the main Eukaryogenesis project investigating the evolutionary ancetries of eukaryotic protein families. Currently (non-static) available at: https://www.biorxiv.org/content/10.1101/2024.10.14.618318v2 https://assets-eu.researchsquare.com/files/rs-5352492/v1/2f9c68ae-cf3e-420a-8d29-867b6fb1a878.pdf All code used to generate the data present within this repository available at: https://github.com/VictorTobiasson/eukgen ### General information To identify associations between prokaryotic and eukaryotic protein families, separate hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using mmseqs2, followed by a multistep data-reduction and multiple sequence alignment (MSA) procedure to generate HMM profiles using hhsuite. A prokaryotic database of 37 million protein sequences was curated from prokaryotic genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins extracted from 146 Asgard genome assemblies. To avoid inclusion of genes present only within a narrow subset of species, possibly resulting from horizontal transfer from eukaryotes post LECA, we reconstructed the “soft-core” pangenome for each of the 26 curated prokaryotic taxonomic classes. These pangenomes include only those genes that are present in at least 67% of the families within each class of Bacteria and Archaea. The initial eukaryotic database consisted of 30 million protein sequences from 993 species taken from EukprotV3 and cleaned using mmseqs2 to remove likely prokaryotic contaminants. Both databases were clustered and MSAs constructed for all non, singleton clusters and HMM profiles created. The resulting eukaryotic HMM dataset was queried against the prokaryotic dataset using hhblits to identify sets of homologous protein sequences. Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual sequence set, hereinafter referred to as an Eukaryotic/Prokaryotic Orthologous Cluster (EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes (each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic proteins can be present in multiple EPOCs) that were used for phylogenetic tree construction, annotation, and evolutionary hypothesis testing. To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC, rather than relying on the tree topology directly, we employed a probabilistic approach for evolutionary hypothesis testing using constraint trees. We exhaustively sampled all arrangements of likely sister clades and obtained Expected Likelihood Weights (ELW) for the set of possible sister clade models. As the ELW metric is analogous to model selection confidence, here we take it to be proportional to the probability of a sampled prokaryotic clade to be the true sister group of the given eukaryotic clade among a set of competing sister clades. For each EPOC, our analysis dynamically accounts for long branch outliers and is robust to phylogenetically non-homogenous clades. This analysis is further capable of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a single datapoint for downstream analysis. Our resulting data contains EPOCs annotated using profiles generated from KEGG Orthology Groups (KOGs), each with an MSA generated using muscle5, a maximum likelihood tree inferred using IQtree2 and associated ELW values for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was performed only for those eukaryotic clades that included more than 5 distinct taxonomic labels, with at least one coming from Amorphea and one from Diaphoretickes, the two expansive eukaryotic clades considered to represent either the first or the second bifurcation in the evolution of eukaryotes. Thus, these clades likely represent genes mapping back to the LECA. For further details please see main publication or contact victor.tobiasson@nih.gov eugene.koonin@nih.gov ### Included files Unless otherwise stated all files contained are tab separated and utf-8 encoded with the first row containing header information. All data entries encoding lists are “|” (pipe) separated. Fields without data values are filled with string entries of “none”. --- Databases --- euk72_ep.tar.gz prok2311_as.tar.gz Prok2311As_final_clusters.tsv Euk72Ep_final_clusters.tsv prok2311_as.hmmDB.tar.gz euk72_ep.hmmDB.tar.gz --- Annotation and Curation --- NCBI_taxonomy_species_addendum.tsv NCBI_taxonomy_class_addendum.tsv Euk72Ep_Prok2311As_final_classes.tsv Euk72Ep_Prok2311As_final_classes.GTDB.tsv KEGG_category_mapping.tsv KEGG_metadata.tsv --- EPOC data --- EPOC_data.tar.gz EPOC_annotation_KEGG.tsv EPOC_data.tsv EPOC_data.pangenomes_s10.tsv EPOC_data.pangenomes_s25.tsv EPOC_data.pangenomes_s67.tsv EPOC_data.GTDB.tsv # euk72_ep.tar.gz Gunzip-ed .tar archive containing a single directory with 10 files constituting the initial eukaryotic mmseqs2 database with taxonomy annotation. Constructed from a pre-selected list of 72 eukaryotic proteomes downloaded from NCBI as well as a “clean” version of Eukprot, lacking highly prokaryotic-like contaminant sequences. # prok2311_as.tar.gz Gunzip-ed .tar archive containing a single directory with 10 files constituting the initial prokaryotic mmseqs2 database with taxonomy annotation. Constructed from 47545 complete genomes retrieved from NCBI in November 2023. # prok2311_as.hmmDB.tar.gz Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted from prok2311_as non--singleton clusters, contains 26286 profiles. # euk72_ep.hmmDB.tar.gz Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted from euk72_ep non-singleton clusters, contains 1631704 profiles. # NCBI_taxonomy_species_addendum.tsv Taxonomy mapping file with manually curated ‘class’ level annotation for poorly annotated species. taxid: NCBI taxid proposed_class_id: Manually assigned NCBI taxid proposed_class_label: NCBI class name org_name: NCBI organism name # NCBI_taxonomy_class_addendum.tsv Class revision file mapping poorly populated class level entries to higher order manually curated labels. Also includes information for small classes with shallow taxonomy which are deleted from the EPOC analysis at the level of tree construction. taxid: NCBI taxid ncbi_class: NCBI taxid of rank corresponding to ‘class’ following manual amendment as per NCBI_taxonomy_species_addendum.tsv revised_class_id: Manually assigned NCBI taxid of rank corresponding to ‘class’ revised_class_label: Proposed cleartext name of manually revised revised_class_id # Euk72Ep_Prok2311As_final_classes.tsv Final taxonomy at NCBI rank ‘class’ following revisions for all sequences in Euk72Ep or Prok2311As. These taxonomic labels are used for EPOC tree annotation. acc: mmseqs database header in either prok2311_as or euk72_ep databases taxid: NCBI taxid for organism superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya, used to define Eukaryotic outgroups in EPOC analysis class: Cleartext name of manually revised NCBI rank ‘class’ identifier for annotation # Euk72Ep_Prok2311As_final_classes.GTDB.tsv Final taxonomy at GTDB rank ‘phylum’ transferred using marker genes from GTDB release 220 acc: mmseqs database header in either prok2311_as or euk72_ep databases taxid: NCBI taxid for organism superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya, used to define Eukaryotic outgroups in EPOC analysis class: Cleartext name of assigne GTDB phylum # Prok2311As_final_clusters.tsv Cluster mapping file for accessions within the initial Prok2311A database to the final clusters used for HMM creation cluster_acc: cluster representative acc: cluster member # Euk72Ep_final_clusters.tsv Cluster mapping file for accessions within the initial Prok2311A database to the final clusters used for HMM creation cluster_acc: cluster representative acc: cluster member # EPOC_data.tar.gz Gunzip-ed directory containing 16035 EPOC folders. Each folder is named corresponding to the eukaryotic cluster representative which generated its profile as an ID Matches the tree_name field in EPOC_data_prok2311As.tsv contains the following files: .merged.fasta: sequences for all members of the EPOC .merged.fasta.leaf_mapping: tsv separated file containing taxonomy and tree reduction data .merged.fasta.muscle: main cropped MSA for tree generation .merged.fasta.muscle.iqtree: IQtree2 output from tree generation .merged.fasta.muscle.treefile.annot: annotated newick tree file with final tree .merged.tree_data.tsv: final parsed tree data with columns matching EPOC_data_prok2311As.tsv EPOCs with more than one possible eukaryotic sister phyla also contains a folder "constraint_analysis" with constraint tree information used for ELW value calculation. # EPOC_data.tsv Main resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) based on pangenomes defined as including 10% of species per class. This is the main data to be used for genereting the core dataset and for data visualistation Contains information regarding tree breakdown, LCA membership and phylogenetic distances between all detected LCAs. Equivalent to the stacked dataframes from all EPOC directories in EPOC_data tree_name: unique index for each EPOC euk_clade_rep: unique index for each annotated eukaryotic clade within each tree_name euk_clade_size: number of original sequences represented by euk_clade_rep euk_clade_weight: metric for taxonomic purity for each euk_clade_rep euk_leaf_clade: boolean indicating whether euk_clade_rep contains a single leaf euk_LCA: lowest taxa spanning all members in euk_clade_rep euk_scope: list of all taxonomic classes in euk_clade_rep euk_scope_len: length of euk_scope list prok_clade_rep: unique index for each annotated prokaryotic clade for each euk_clade_rep prok_clade_size: number of original sequences represented by prok_clade_rep prok_clade_weight: metric for taxonomic purity for each prok_clade_rep prok_leaf_clade: boolean indicating whether prok_clade_rep contains a single leaf prok_taxa: lowest taxa spanning all members in prok_clade_rep dist: tree-distance from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep top_dist: graph-distance (node-distance) from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep raw_stem_length: tree-distance from lowest tree node containing the union of all members of prok_clade_rep and euk_clade_rep to the tree node containing all members of euk_clade_rep median_euk_leaf_dist: median value for all tree distances from the tree node containing all members of euk_clade_rep to the individual leaves stem_length: raw_stem_length/median_euk_leaf_dist logL: log likelihood of best constraint tree constructed deltaL: log likelihood difference between constraint tree for prok_clade_rep and best constraint tree constructed bp-RELL: validation metric from IQtree -trees, see iqtree.org bp-RELL_accept: as above p-KH: as above p-KH_accept: as above p-SH: as above p-SH_accept: as above c-ELW: as above c-ELW_accept: as above p-AU: as above p-AU_accept: as above # EPOC_data.pangenomes_s10.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated based on pangenomes defined as including 10% of species per class. Identical file structure to EPOC_data.tsv # EPOC_data.pangenomes_s25.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated based on pangenomes defined as including 25% of species per class. Identical file structure to EPOC_data.tsv # EPOC_data.pangenomes_s67.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated based on pangenomes defined as including 67% of species per class. Identical file structure to EPOC_data.tsv # EPOC_data.GTDB.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated under revised taxonomy from GTDB based on data from Euk72Ep_Prok2311As_final_classes.GTDB.tsv Identical file structure to EPOC_data.tsv # EPOC_data.alpha_replicates.tsv Resulting data from 20 repetitions of Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated from a subset of Alphaproteobacterial-derived EPOCs. Identical file structure to EPOC_data.tsv with the addition of: rep: indicating technical replicate number, 0-19 # EPOC_annotation_KEGG.tsv Parsed HHblits output of HMM profiles generated from KEGG KOGs (KEGG Orthologous Groups) against eukaryotic profiles constituting each EPOC Query: query name equal to tree_name from EPOC_data Target: target name equal to kogid in KEGG_category_mapping and KEGG_metadata Prob: data from HHblits, see https://github.com/soedinglab/hh-suite/wiki E-value : as above P-value : as above Score: as above SS: as above Cols: as above Identities: as above Similarity: as above Sum_probs: as above Query-HMM-start: as above Query-HMM-end: as above Template-HMM-start: as above Template-HMM-end: as above Template_columns: as above Template_Neff : as above Pairwise_cov: calculated pairwise coverage from Query and Target start and end Description: category_name from KEGG_category_mapping # KEGG_category_mapping.tsv Mapping of relevant KOG identifiers to their higher order categories as "Maps" "Modules" or "Reactions" as per KEGG see https://www.kegg.jp/kegg/pathway.html kogid: unique KOG identifier category_id: KEGG map, module, or reaction number category_name: cleartext name for KOG identifier # KEGG_metadata.tsv File mapping KOGs to BRITE classification and to additional databases of chemical properties. kogid: unique KOG identifier name: cleartext name for KOG identifier brite_A: list of BRITE-A sets including KOG brite_B: list of BRITE-A sets including KOG brite_C: list of BRITE-A sets including KOG EC: list of Enzyme commission numbers associated with KOG, see https://enzyme.expasy.org/ TC: list of transporter classification numbers associated with KOG, see https://www.tcdb.org/ RN: list of KEGG reaction numbers associated with KOG CA: list of CAZY numbers associated with KOG, see http://www.cazy.org/ GO: list of GO terms associated with KOG, see https://geneontology.org/

创建时间：

2025-03-22