Tables_S1_to_S15_EukPhylo
收藏DataCite Commons2025-10-07 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_S1_to_S15_PhyloToL/26540599/5
下载链接
链接失效反馈官方服务:
资源简介:
<b>Table S1 (separate file): </b>A record of every taxon and the corresponding sequence data used in the study. Each taxon is given a unique ten-digit code that species major and minor clades (column 1; see manual). The “R2G file” refers to that taxon’s “ReadyToGo” file, the output of EukPhylo part 1 containing all initially curated coding sequences for a taxon with OGs assigned. “GC3” refers to the GC content at four-fold degenerate sites; the range, minimum, and maximum refer to the value of this silent-site GC content across all transcripts/CDS in the taxon’s ReadyToGo file. Taxa with multiple accessions are cases where all sequence data from all listed accessions were pooled in assembly. Putative genetic codes were determined by examining in-frame stop codon frequencies.<br><b>Table S2 (separate file): </b>A description of all “utility” scripts supplied on the GitHub (https://github.com/Katzlab/EukPhylo). See main text and methods in the SI Text for more information. The script name (column 2) corresponds to the name of each script on GitHub, with a brief description of the purpose and output of each script.<br><b>Table S3 (separate file):</b> Descriptive statistics of the OGs in the Hook Database, used as a reference for OG assignment in EukPhylo part 1. The first four columns (under the heading “OrthoMCL”) refer to all sequences in the OrthoMCL version 6.13 “core” set of OGs (excluding all “peripheral” OGs). Major and minor clade designations are as in Tables S1,S6. The next four columns (under the heading “Hook”) refer to all sequences in the Hook Database, described in the main text (File S1). The next four columns (under the heading “R2Gs”) refer to all sequences in the ReadyToGo files (see main text, Table S1) from the 1,000 species included in our analyses, after filtering for silent site composition as described in the methods section. The remainder of the columns describe the 5 most frequent terms used to annotate sequences in the “deflines_OrthoMCL-6.13.txt.gz” file provided by OrthoMCL (https://orthomcl.org/orthomcl/app/downloads/release-6.13/).<br><b>Table S4 (separate file): </b>A summary of the GO terms identified for each OG using EggNOG. See methods.<br><b>Table S5 (separate file): </b>A summary of the performance of a set of exemplar runs of EukPhylo part 1 and part2. See results.<br><b>Table S6 (separate file): </b>A summary of taxon code prefixes corresponding to “major” (first two characters) and “minor” (first 5 characters) clades, along with the number of species (out of 1000 total) in the study falling in each minor clade.<br><b>Table S7 (separate file): </b>A summary of the number of species included in the study per “major” clade, and the number of whole genome assemblies vs. whole transcriptome assemblies used available for major clade.<br><b>Table S8 (separate file): </b>The file that we input to the ‘contamination loop’ of EukPhylo part two that defines rules for removing putative contaminant sequences based on sister relationships. Each row represents a rule, for which a sequence from a taxon (first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. If the third column is “NA”, then there was no branch length restriction for the rule. See the methods for details.<br><b>Table S9 (separate file): </b>The file that we input to the ‘contamination loop’ of EukPhylo part two that defines rules for removing putative contaminant sequences based on ‘subsister’ relationships, where sequence A’s subsister is defined as the sister of A’s parent node. Sequences from a taxon in the first column were removed if their subsister belonged to the corresponding taxon in the second column. See exemplary runs for details on set up.<br><b>Table S10 (separate file): </b>The rules for clade-based contamination removal of ciliate clades, primarily to mitigate contamination by parabasalids. As a result, any ciliate was removed if it was not in a clade with at least 12 ciliates (column 3) and no more than 1 non-ciliate (column 2) The taxa listed under ‘Exceptions’ were given in the ‘exceptions’ file (see EukPhylo manual) and were not removed during this stage under any conditions.<br><b>Table S11 (separate file): </b>The rules for general clade-based contamination removal. Sequences from a taxon (column 1) were removed if they did not fall into a clade with at least a certain number of other species of that taxon (column 3) and with a limit on the number of species not belonging to the taxon that are also in the monophyletic clade (column 2). If the ‘required taxa’ column is not NA, then the clade must also include at least X species from the set of required taxa given in column 4, where X is the value given in column 5. All txt files in this table refers to lists of specific taxa that cannot be summarized by 5 digit codes only. “LKH Ciliates” refers to all taxa (rows) in Table S1 with a taxon code beginning with Sr_ci and with an accession that begins with “LKH”. Similarly, “LKH Foraminifera” refers to all taxa in Table S1 with “Foraminifera” in their taxonomy and with an accession beginning with “LKH”. These specifications, along with the “Foraminifera” (all Foraminifera) and “Cercozoa” labels are given as paths to text files with the corresponding list of ten digit codes when inputting the rules file to EukPhylo part 2 (see manual, and tests runs on figshare for details). The list of codes under “Exceptions” refers to taxa that are not removed by the contamination loop under any conditions at this stage (see manual).<br><b>Table S12 (separate file): </b>This file contains different tables that describe the results of EGT removal analysis (list of target taxa for CladeGrabing.py and identification of putative EGT; and the final list of GFs included in this analysis) and describing the topology of all concatenated and asteroid trees in our analysis: the output of CladeSize.py to identify the number of clades per minor or major clade in each trees; and the details of supergroups topologies.<br><b>Table S13 (separate file): </b>A description of the taxa containing each of the 500 OGs used in this study at each stage of curation. The second column indicates whether or not the OG was removed in our “EGT removed” species tree analysis because it exhibited evidence of endosymbiotic gene transfer among photosynthetic taxa. “Prop. species” indicates the proportion of all species included in the study (N = 1000) that contain the OG.<br><b>Table S14 (separate file): </b>A description of the ‘missing data’ at each stage in the contamination removal process for each taxon. “% gaps” indicates the proportion of the concatenated alignment (used for building the species trees given in Fig. 4), and “# OGs” the number of OGs found in each taxon. Note that in the “EGT removed” stage, the total number of OGs decreases from 500 to 331.<b>Table S15 (separate file): </b>A summary of all of the taxa included in the Hook Database, as seeded by data from OrthoMCL version 6.13. The “OrthoMCL species code” refers to the 4-digit code assigned by OrthoMCL to identify the taxon, and the “Core/peripheral” designation is also given by OrthoMCL; the “Clade code” prefix was assigned by the authors and corresponds to the same clade codes as in Table S1 (though four-digit codes that are the same between these datasets do not necessarily represent the same data, see “Accession” column in Table S1). Taxa used as “BLAST-able” taxa in development of the Hook Database (see methods in the SI appendix) are labeled in the first column.
提供机构:
figshare
创建时间:
2025-10-07



