five

Tables_S1_to_S15_EukPhylo

收藏
DataCite Commons2025-01-17 更新2025-05-07 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_S1_to_S15_PhyloToL/26540599/2
下载链接
链接失效反馈
官方服务:
资源简介:
<b>Table S1 (separate file): </b>A record of every taxon and the corresponding sequence data used in the study. Each taxon is given a unique ten-digit code that species major and minor clades (column 1; see manual). The “R2G file” refers to that taxon’s “ReadyToGo” file, the output of EukPhylo part 1 containing all initially curated coding sequences for a taxon with OGs assigned. “GC3” refers to the GC content at four-fold degenerate sites; the range, minimum, and maximum refer to the value of this silent-site GC content across all transcripts/CDS in the taxon’s ReadyToGo file. Taxa with multiple accessions are cases where all sequence data from all listed accessions were pooled in assembly. Putative genetic codes were determined by examining in-frame stop codon frequencies.<br><b>Table S2 (separate file): </b>A description of all “utility” scripts supplied on the GitHub (https://github.com/Katzlab/EukPhylo). See main text and methods in the SI Text for more information. The script name (column 2) corresponds to the name of each script on GitHub, with a brief description of the purpose and output of each script.<br><b>Table S3 (separate file):</b> Descriptive statistics of the OGs in the Hook Database, used as a reference for OG assignment in EukPhylo part 1. The first four columns (under the heading “OrthoMCL”) refer to all sequences in the OrthoMCL version 6.13 “core” set of OGs (excluding all “peripheral” OGs). Major and minor clade designations are as in Tables S1,S6. The next four columns (under the heading “Hook”) refer to all sequences in the Hook Database, described in the main text (File S1). The next four columns (under the heading “R2Gs”) refer to all sequences in the ReadyToGo files (see main text, Table S1) from the 1,000 species included in our analyses, after filtering for silent site composition as described in the methods section. The remainder of the columns describe the 5 most frequent terms used to annotate sequences in the “deflines_OrthoMCL-6.13.txt.gz” file provided by OrthoMCL (https://orthomcl.org/orthomcl/app/downloads/release-6.13/).<br><b>Table S4 (separate file): </b>A summary of the GO terms identified for each OG using EggNOG. See methods.<br><b>Table S5 (separate file): </b>A summary of the performance of a set of exemplar runs of EukPhylo part 1. See results.<br><b>Table S6 (separate file): </b>A summary of taxon code prefixes corresponding to “major” (first two characters) and “minor” (first 5 characters) clades, along with the number of species (out of 1000 total) in the study falling in each minor clade.<br><b>Table S7 (separate file): </b>A summary of the number of species included in the study per “major” clade, and the number of whole genome assemblies vs. whole transcriptome assemblies used available for major clade.<br><b>Table S8 (separate file): </b>The file that we input to the ‘contamination loop’ of EukPhylo part two that defines rules for removing putative contaminant sequences based on sister relationships. Each row represents a rule, for which a sequence from a taxon (first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. If the third column is “NA”, then there was no branch length restriction for the rule. See the methods for details.<br><b>Table S9 (separate file): </b>The file that we input to the ‘contamination loop’ of EukPhylo part two that defines rules for removing putative contaminant sequences based on ‘subsister’ relationships, where sequence A’s subsister is defined as the sister of A’s parent node. Sequences from a taxon in the first column were removed if their subsister belonged to the corresponding taxon in the second column. See exemplary runs for details on set up.<br><b>Table S10 (separate file): </b>The rules for clade-based contamination removal of ciliate clades, primarily to mitigate contamination by parabasalids. As a result, any ciliate was removed if it was not in a clade with at least 12 ciliates (column 3) and no more than 1 non-ciliate (column 2) The taxa listed under ‘Exceptions’ were given in the ‘exceptions’ file (see EukPhylo manual) and were not removed during this stage under any conditions.<br><b>Table S11 (separate file): </b>The rules for general clade-based contamination removal. Sequences from a taxon (column 1) were removed if they did not fall into a clade with at least a certain number of other species of that taxon (column 3) and with a limit on the number of species not belonging to the taxon that are also in the monophyletic clade (column 2). If the ‘required taxa’ column is not NA, then the clade must also include at least X species from the set of required taxa given in column 4, where X is the value given in column 5. All txt files in this table refers to lists of specific taxa that cannot be summarized by 5 digit codes only. “LKH Ciliates” refers to all taxa (rows) in Table S1 with a taxon code beginning with Sr_ci and with an accession that begins with “LKH”. Similarly, “LKH Foraminifera” refers to all taxa in Table S1 with “Foraminifera” in their taxonomy and with an accession beginning with “LKH”. These specifications, along with the “Foraminifera” (all Foraminifera) and “Cercozoa” labels are given as paths to text files with the corresponding list of ten digit codes when inputting the rules file to EukPhylo part 2 (see manual, and tests runs on figshare for details). The list of codes under “Exceptions” refers to taxa that are not removed by the contamination loop under any conditions at this stage (see manual).<br><b>Table S12 (separate file): </b>This file contains different tables that describe the results of EGT removal analysis (list of target taxa for CladeGrabing.py and identification of putative EGT; and the final list of GFs included in this analysis) and describing the topology of all concatenated and asteroid trees in our analysis: the output of CladeSize.py to identify the number of clades per minor or major clade in each trees; and the details of supergroups topologies.<br><b>Table S13 (separate file): </b>A description of the taxa containing each of the 500 OGs used in this study at each stage of curation. The second column indicates whether or not the OG was removed in our “EGT removed” species tree analysis because it exhibited evidence of endosymbiotic gene transfer among photosynthetic taxa. “Prop. species” indicates the proportion of all species included in the study (N = 1000) that contain the OG.<br><b>Table S14 (separate file): </b>A description of the ‘missing data’ at each stage in the contamination removal process for each taxon. “% gaps” indicates the proportion of the concatenated alignment (used for building the species trees given in Fig. 4), and “# OGs” the number of OGs found in each taxon. Note that in the “EGT removed” stage, the total number of OGs decreases from 500 to 331.<b>Table S15 (separate file): </b>A summary of all of the taxa included in the Hook Database, as seeded by data from OrthoMCL version 6.13. The “OrthoMCL species code” refers to the 4-digit code assigned by OrthoMCL to identify the taxon, and the “Core/peripheral” designation is also given by OrthoMCL; the “Clade code” prefix was assigned by the authors and corresponds to the same clade codes as in Table S1 (though four-digit codes that are the same between these datasets do not necessarily represent the same data, see “Accession” column in Table S1). Taxa used as “BLAST-able” taxa in development of the Hook Database (see methods in the SI appendix) are labeled in the first column.

<b>补充表S1(独立文件):</b>本研究中所用全部分类单元及对应序列数据的记录。每个分类单元均被赋予唯一的十位编码,用于界定其所属的主要与次要演化支(第1列;详见操作手册)。“R2G文件”指对应分类单元的“ReadyToGo”文件,即EukPhylo第一部分的输出产物,包含该分类单元所有经初步整理的编码序列,且已完成同源基因簇(Orthologous Groups, OGs)的分配。“GC3”指代四倍简并位点的GC含量,其取值范围、最小值与最大值代表该分类单元ReadyToGo文件中所有转录本/编码序列(CDS)的该沉默位点GC含量。存在多个登录号的分类单元,指将所有所列登录号的序列数据合并组装后得到的结果。推定遗传密码通过检测读码框内终止密码子的频率确定。<br><b>补充表S2(独立文件):</b>本研究在GitHub(https://github.com/Katzlab/EukPhylo)上提供的所有“工具脚本”的说明文档。更多信息详见正文及补充材料文本中的方法部分。脚本名称(第2列)与GitHub上各脚本的实际名称一一对应,同时附带各脚本的用途与输出结果的简要说明。<br><b>补充表S3(独立文件):</b>Hook数据库中同源基因簇(OGs)的描述性统计数据,用作EukPhylo第一部分中OG分配的参考基准。以“OrthoMCL”为标题的前四列,对应OrthoMCL版本6.13核心同源基因簇集(排除所有“外围”同源基因簇)中的全部序列。主要与次要演化支的命名规则与补充表S1、S6一致。以“Hook”为标题的接下来四列,对应正文中所述Hook数据库中的全部序列(详见附录文件S1)。以“R2Gs”为标题的再接下来四列,对应本研究分析所用1000个物种的ReadyToGo文件(详见正文及补充表S1)中,经方法部分所述的沉默位点组成过滤后的所有序列。其余列用于描述OrthoMCL提供的"deflines_OrthoMCL-6.13.txt.gz"文件中,用于注释序列的5个最频繁出现的术语(该文件可从https://orthomcl.org/orthomcl/app/downloads/release-6.13/下载)。<br><b>补充表S4(独立文件):</b>使用EggNOG为每个同源基因簇(OGs)鉴定得到的GO术语汇总结果。详见方法部分。<br><b>补充表S5(独立文件):</b>EukPhylo第一部分的一组示例运行结果的性能汇总。详见结果部分。<br><b>补充表S6(独立文件):</b>对应“主要”(前两位字符)与“次要”(前五位字符)演化支的分类单元编码前缀汇总,以及每个次要演化支中包含的研究物种数量(总计1000种)。<br><b>补充表S7(独立文件):</b>本研究中每个“主要”演化支所含物种数量,以及该演化支可用的全基因组组装与全转录组组装的数量统计。<br><b>补充表S8(独立文件):</b>输入至EukPhylo第二部分“污染过滤循环”的文件,用于定义基于姊妹群关系移除推定污染序列的规则。每一行代表一条规则:若某分类单元的序列(第1列)与另一分类单元的序列(第2列)构成姊妹群,且二者所在分支的长度短于X倍的树内平均分支长度,则移除该序列,其中X为第3列的数值。若第3列为“NA”,则该规则无分支长度限制。详见方法部分。<br><b>补充表S9(独立文件):</b>输入至EukPhylo第二部分“污染过滤循环”的文件,用于定义基于“亚姊妹群”关系移除推定污染序列的规则:序列A的亚姊妹群被定义为A所在父节点的姊妹群。若某分类单元(第1列)的序列的亚姊妹群属于第2列对应的分类单元,则移除该序列。具体设置细节详见示例运行部分。<br><b>补充表S10(独立文件):</b>基于演化支的纤毛虫类群污染移除规则,主要用于减轻副基虫污染。据此规则,任何纤毛虫若未处于至少包含12个纤毛虫(第3列)且非纤毛虫数量不超过1个(第2列)的演化支中,则将被移除。“例外”栏所列的分类单元在“例外文件”中指定(详见EukPhylo操作手册),在该阶段无论何种条件均不会被移除。<br><b>补充表S11(独立文件):</b>通用的基于演化支的污染移除规则。若某分类单元(第1列)的序列未处于满足以下条件的单系演化支中,则将其移除:该演化支至少包含对应分类单元的指定数量的其他物种(第3列),且该演化支中非该分类单元的物种数量存在上限(第2列)。若“所需类群”列不为NA,则该演化支还必须至少包含第4列所列所需类群集合中的X个物种,其中X为第5列的数值。本表格中提及的所有txt文件,指代无法仅通过五位编码概括的特定类群列表。“LKH纤毛虫”指补充表S1中所有分类单元前缀为Sr_ci且登录号以“LKH”开头的类群;同理,“LKH有孔虫”指补充表S1中分类学信息包含“Foraminifera”且登录号以“LKH”开头的类群。上述规范,以及“有孔虫”(所有有孔虫类群)和“丝足虫类(Cercozoa)”标签,均对应为输入EukPhylo第二部分规则文件时所需的、包含对应十位编码列表的文本文件路径(详见操作手册及figshare上的测试运行说明)。“例外”栏下的编码列表,指代该阶段无论何种条件均不会被污染过滤循环移除的分类单元(详见操作手册)。<br><b>补充表S12(独立文件):</b>本文件包含多组表格,用于描述内共生基因转移(Endosymbiotic Gene Transfer, EGT)移除分析的结果:包括CladeGrabing.py的目标分类单元列表、推定EGT的鉴定结果,以及本分析所用的最终GF列表;同时描述本研究中所有串联树与星状树的拓扑结构:包括CladeSize.py的输出结果(用于统计每棵树中每个主要/次要演化支的演化支数量),以及超类群拓扑结构的详细信息。<br><b>补充表S13(独立文件):</b>本研究所用500个同源基因簇(OGs)在每个整理阶段的所属分类单元说明。第2列标注该同源基因簇是否在“EGT移除”物种树分析中被移除——若该同源基因簇在光合类群间存在内共生基因转移的证据,则会被移除。“物种占比(Prop. species)”指代本研究所有纳入分析的物种(N=1000)中,携带该同源基因簇的物种比例。<br><b>补充表S14(独立文件):</b>每个分类单元在污染移除流程各阶段的“缺失数据”情况说明。“缺失率(% gaps)”指代用于构建图4所示物种树的串联比对序列的缺失比例,“OG数量(# OGs)”指代该分类单元中检测到的同源基因簇数量。需注意,在“EGT移除”阶段,同源基因簇的总数从500个降至331个。<br><b>补充表S15(独立文件):</b>Hook数据库中所有纳入分类单元的汇总,该数据库以OrthoMCL版本6.13的数据为基础构建。“OrthoMCL物种编码”指OrthoMCL分配的用于标识分类单元的四位编码,“核心/外围”标注同样由OrthoMCL提供;“演化支编码前缀”由本研究作者自行分配,与补充表S1中的演化支编码规则一致(不过这两个数据集中共有的四位编码未必对应相同的数据,详见补充表S1中的“登录号”列)。本研究开发Hook数据库时用作“可BLAST比对”的分类单元(详见补充附录方法部分),将在第1列中予以标注。
提供机构:
figshare
创建时间:
2025-01-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作