five

Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14002644
下载链接
链接失效反馈
官方服务:
资源简介:
Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis".  Victor Tobiasson, Jacob Luo, Yuri I Wolf, Eugene V Koonin Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA The Origin of eukaryotes is one of the key problems in evolutionary biology. The demonstration that the Last Eukaryotic Common Ancestor (LECA) already contained the mitochondrion, an endosymbiotic organelle derived from an alphaproteobacterium, and the discovery of Asgard archaea, the closest archaeal relatives of eukaryotes inform and constrain evolutionary scenarios of eukaryogenesis. We undertook a comprehensive analysis of the origins of the core eukaryotic genes tracing to the LECA within a rigorous statistical framework centered around evolutionary hypotheses testing using constrained phylogenetic trees. The results reveal dominant contributions of Asgard archaea to the origin of most of the conserved eukaryotic functional systems and pathways. A limited contribution from Alphaproteobacteria was identified, primarily relating to the energy transformation systems and Fe-S cluster biogenesis, whereas ancestry from other bacterial phyla was scattered across the eukaryotic functional landscape, without consistent trends. These findings suggest a model of eukaryogenesis in which key features of eukaryotic cell organization evolved in the Asgard ancestor, followed by the capture of the Alphaproteobacterial endosymbiont, and augmented by numerous but sporadic horizontal acquisition of genes from other bacteria both before and after endosymbiosis.  Version 0.3, updated 180325     Main data repository for: Dominant contribution of Asgard archaea to eukaryogenesis (2024)  Tobiasson, V., Koonin, E.   Contains all final parsed data from the main Eukaryogenesis project  investigating the evolutionary ancetries of eukaryotic protein families.    Currently (non-static) available at:  https://www.biorxiv.org/content/10.1101/2024.10.14.618318v2 https://assets-eu.researchsquare.com/files/rs-5352492/v1/2f9c68ae-cf3e-420a-8d29-867b6fb1a878.pdf   All code used to generate the data present within this repository available at:  https://github.com/VictorTobiasson/eukgen      ### General information   To identify associations between prokaryotic and eukaryotic protein families, separate hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed  using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using  mmseqs2, followed by a multistep data-reduction and multiple sequence alignment (MSA)  procedure to generate HMM profiles using hhsuite.    A prokaryotic database of 37 million protein sequences was curated from prokaryotic  genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins  extracted from 146 Asgard genome assemblies. To avoid inclusion of genes present only  within a narrow subset of species, possibly resulting from horizontal transfer from  eukaryotes post LECA, we reconstructed the “soft-core” pangenome for each of the 26  curated prokaryotic taxonomic classes. These pangenomes include only those genes that  are present in at least 67% of the families within each class of Bacteria and Archaea.  The initial eukaryotic database consisted of 30 million protein sequences from 993  species taken from EukprotV3 and cleaned using mmseqs2 to remove likely prokaryotic  contaminants.    Both databases were clustered and MSAs constructed for all non, singleton clusters  and HMM profiles created. The resulting eukaryotic HMM dataset was queried against  the prokaryotic dataset using hhblits to identify sets of homologous protein sequences.  Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual  sequence set, hereinafter referred to as an Eukaryotic/Prokaryotic Orthologous Cluster  (EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes  (each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic  proteins can be present in multiple EPOCs) that were used for phylogenetic tree  construction, annotation, and evolutionary hypothesis testing.    To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC,  rather than relying on the tree topology directly, we employed a probabilistic approach  for evolutionary hypothesis testing using constraint trees. We exhaustively sampled all  arrangements of likely sister clades and obtained Expected Likelihood Weights (ELW) for  the set of possible sister clade models. As the ELW metric is analogous to model selection  confidence, here we take it to be proportional to the probability of a sampled prokaryotic  clade to be the true sister group of the given eukaryotic clade among a set of competing  sister clades. For each EPOC, our analysis dynamically accounts for long branch outliers  and is robust to phylogenetically non-homogenous clades. This analysis is further capable  of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a  single datapoint for downstream analysis. Our resulting data contains EPOCs annotated  using profiles generated from KEGG Orthology Groups (KOGs), each with an MSA generated  using muscle5, a maximum likelihood tree inferred using IQtree2 and associated ELW values  for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was  performed only for those eukaryotic clades that included more than 5 distinct taxonomic  labels, with at least one coming from Amorphea and one from Diaphoretickes, the two  expansive eukaryotic clades considered to represent either the first or the second  bifurcation in the evolution of eukaryotes. Thus, these clades likely represent genes  mapping back to the LECA.   For further details please see main publication or contact victor.tobiasson@nih.gov eugene.koonin@nih.gov     ### Included files   Unless otherwise stated all files contained are tab separated and utf-8 encoded  with the first row containing header information.  All data entries encoding lists are “|” (pipe) separated.  Fields without data values are filled with string entries of “none”.   --- Databases --- euk72_ep.tar.gz prok2311_as.tar.gz Prok2311As_final_clusters.tsv Euk72Ep_final_clusters.tsv prok2311_as.hmmDB.tar.gz euk72_ep.hmmDB.tar.gz   --- Annotation and Curation --- NCBI_taxonomy_species_addendum.tsv NCBI_taxonomy_class_addendum.tsv Euk72Ep_Prok2311As_final_classes.tsv Euk72Ep_Prok2311As_final_classes.GTDB.tsv KEGG_category_mapping.tsv KEGG_metadata.tsv   --- EPOC data --- EPOC_data.tar.gz EPOC_annotation_KEGG.tsv EPOC_data.tsv EPOC_data.pangenomes_s10.tsv EPOC_data.pangenomes_s25.tsv EPOC_data.pangenomes_s67.tsv EPOC_data.GTDB.tsv   # euk72_ep.tar.gz Gunzip-ed .tar archive containing a single directory with 10 files  constituting the initial eukaryotic mmseqs2 database with taxonomy annotation.  Constructed from a pre-selected list of 72 eukaryotic proteomes downloaded from  NCBI as well as a “clean” version of Eukprot, lacking highly prokaryotic-like  contaminant sequences.    # prok2311_as.tar.gz Gunzip-ed .tar archive containing a single directory with 10 files constituting the  initial prokaryotic mmseqs2 database with taxonomy annotation. Constructed from  47545 complete genomes retrieved from NCBI in November 2023.    # prok2311_as.hmmDB.tar.gz Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted  from prok2311_as non--singleton clusters, contains 26286 profiles.   # euk72_ep.hmmDB.tar.gz Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted  from euk72_ep non-singleton clusters, contains 1631704 profiles.   # NCBI_taxonomy_species_addendum.tsv Taxonomy mapping file with manually curated ‘class’ level annotation for poorly  annotated species.    taxid: NCBI taxid proposed_class_id: Manually assigned NCBI taxid proposed_class_label: NCBI class name org_name: NCBI organism name   # NCBI_taxonomy_class_addendum.tsv Class revision file mapping poorly populated class level entries to higher order  manually curated labels. Also includes information for small classes with shallow  taxonomy which are deleted from the EPOC analysis at the level of tree construction.   taxid: NCBI taxid ncbi_class: NCBI taxid of rank corresponding to ‘class’ following manual  amendment as per NCBI_taxonomy_species_addendum.tsv revised_class_id: Manually assigned NCBI taxid of rank corresponding to ‘class’ revised_class_label: Proposed cleartext name of manually revised revised_class_id    # Euk72Ep_Prok2311As_final_classes.tsv Final taxonomy at NCBI rank ‘class’ following revisions for all sequences in Euk72Ep or  Prok2311As. These taxonomic labels are used for EPOC tree annotation.    acc: mmseqs database header in either prok2311_as or euk72_ep databases taxid: NCBI taxid for organism superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,  used to define Eukaryotic outgroups in EPOC analysis class: Cleartext name of manually revised NCBI rank ‘class’ identifier for annotation   # Euk72Ep_Prok2311As_final_classes.GTDB.tsv Final taxonomy at GTDB rank ‘phylum’ transferred using marker genes from GTDB release 220   acc: mmseqs database header in either prok2311_as or euk72_ep databases taxid: NCBI taxid for organism superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,  used to define Eukaryotic outgroups in EPOC analysis class: Cleartext name of assigne GTDB phylum   # Prok2311As_final_clusters.tsv Cluster mapping file for accessions within the initial Prok2311A database to the  final clusters used for HMM creation     cluster_acc: cluster representative acc: cluster member   # Euk72Ep_final_clusters.tsv Cluster mapping file for accessions within the initial Prok2311A database to the  final clusters used for HMM creation   cluster_acc: cluster representative acc: cluster member   # EPOC_data.tar.gz Gunzip-ed directory containing 16035 EPOC folders. Each folder is named corresponding  to the eukaryotic cluster representative which generated its profile as an ID  Matches the tree_name field in EPOC_data_prok2311As.tsv contains the following files:   .merged.fasta: sequences for all members of the EPOC .merged.fasta.leaf_mapping: tsv separated file containing taxonomy and tree reduction data .merged.fasta.muscle: main cropped MSA for tree generation  .merged.fasta.muscle.iqtree: IQtree2 output from tree generation .merged.fasta.muscle.treefile.annot: annotated newick tree file with final tree .merged.tree_data.tsv: final parsed tree data with columns matching  EPOC_data_prok2311As.tsv   EPOCs with more than one possible eukaryotic sister phyla also contains  a folder "constraint_analysis" with constraint tree information used for  ELW value calculation.    # EPOC_data.tsv Main resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs)  based on pangenomes defined as including 10% of species per class. This is the main data to be used for genereting the core dataset and for data visualistation Contains information regarding tree breakdown, LCA membership and phylogenetic  distances between all detected LCAs. Equivalent to the stacked dataframes from all  EPOC directories in EPOC_data    tree_name: unique index for each EPOC  euk_clade_rep: unique index for each annotated eukaryotic clade within each tree_name euk_clade_size: number of original sequences represented by euk_clade_rep euk_clade_weight: metric for taxonomic purity for each euk_clade_rep euk_leaf_clade: boolean indicating whether euk_clade_rep contains a single leaf euk_LCA: lowest taxa spanning all members in euk_clade_rep euk_scope: list of all taxonomic classes in euk_clade_rep euk_scope_len: length of euk_scope list prok_clade_rep: unique index for each annotated prokaryotic clade for each euk_clade_rep prok_clade_size: number of original sequences represented by prok_clade_rep prok_clade_weight: metric for taxonomic purity for each prok_clade_rep prok_leaf_clade: boolean indicating whether prok_clade_rep contains a single leaf prok_taxa: lowest taxa spanning all members in prok_clade_rep dist: tree-distance from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep top_dist: graph-distance (node-distance) from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep raw_stem_length: tree-distance from lowest tree node containing the union of all members of prok_clade_rep and euk_clade_rep to the tree node containing all members of euk_clade_rep median_euk_leaf_dist: median value for all tree distances from the tree node containing all members of euk_clade_rep to the individual leaves stem_length: raw_stem_length/median_euk_leaf_dist logL: log likelihood of best constraint tree constructed deltaL: log likelihood difference between constraint tree for prok_clade_rep and best constraint tree constructed bp-RELL: validation metric from IQtree -trees, see iqtree.org bp-RELL_accept: as above p-KH: as above p-KH_accept: as above p-SH: as above p-SH_accept: as above c-ELW: as above c-ELW_accept: as above p-AU: as above p-AU_accept: as above   # EPOC_data.pangenomes_s10.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated  based on pangenomes defined as including 10% of species per class. Identical file structure to EPOC_data.tsv   # EPOC_data.pangenomes_s25.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated  based on pangenomes defined as including 25% of species per class. Identical file structure to EPOC_data.tsv   # EPOC_data.pangenomes_s67.tsv Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated  based on pangenomes defined as including 67% of species per class. Identical file structure to EPOC_data.tsv   # EPOC_data.GTDB.tsv Resulting data  from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated  under revised taxonomy from GTDB based on data from Euk72Ep_Prok2311As_final_classes.GTDB.tsv Identical file structure to EPOC_data.tsv   # EPOC_data.alpha_replicates.tsv Resulting data from 20 repetitions of Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated  from a subset of Alphaproteobacterial-derived EPOCs.  Identical file structure to EPOC_data.tsv with the addition of:   rep: indicating technical replicate number, 0-19   # EPOC_annotation_KEGG.tsv Parsed HHblits output of HMM profiles generated from KEGG KOGs (KEGG Orthologous Groups)  against eukaryotic profiles constituting each EPOC   Query: query name equal to tree_name from EPOC_data Target: target name equal to kogid in KEGG_category_mapping and KEGG_metadata Prob: data from HHblits, see https://github.com/soedinglab/hh-suite/wiki E-value : as above P-value : as above Score: as above SS: as above Cols: as above Identities: as above Similarity: as above Sum_probs: as above Query-HMM-start: as above Query-HMM-end: as above Template-HMM-start: as above Template-HMM-end: as above Template_columns: as above Template_Neff : as above Pairwise_cov: calculated pairwise coverage from Query and Target start and end Description: category_name from KEGG_category_mapping   # KEGG_category_mapping.tsv Mapping of relevant KOG identifiers to their higher order categories as  "Maps" "Modules" or "Reactions" as per KEGG see https://www.kegg.jp/kegg/pathway.html   kogid: unique KOG identifier category_id: KEGG map, module, or reaction number category_name: cleartext name for KOG identifier   # KEGG_metadata.tsv File mapping KOGs to BRITE classification and to additional databases of chemical properties.   kogid: unique KOG identifier name: cleartext name for KOG identifier brite_A: list of BRITE-A sets including KOG brite_B: list of BRITE-A sets including KOG brite_C: list of BRITE-A sets including KOG EC: list of Enzyme commission numbers associated with KOG, see https://enzyme.expasy.org/ TC: list of transporter classification numbers associated with KOG, see https://www.tcdb.org/ RN: list of KEGG reaction numbers associated with KOG CA: list of CAZY numbers associated with KOG, see http://www.cazy.org/ GO: list of GO terms associated with KOG, see https://geneontology.org/
创建时间:
2025-03-22
二维码
社区交流群
二维码
科研交流群
商业服务