Supplemental Data for Kogay and Zhaxybayeva (2022)
收藏DataCite Commons2022-09-14 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/Supplemental_Data_for_Kogay_and_Zhaxybayeva_2022_/20082749/3
下载链接
链接失效反馈官方服务:
资源简介:
<strong>GenBank accession numbers:</strong> <strong>208_genomes_accessions.xlsx: </strong>List of selected 208 alphaproteobacterial genomes with GTA ‘head-tail’ clusters. <strong>g7_replacement_Sphingomonadales.pdf:</strong> GenBank accession numbers of the putative g7 protein found in 11 <em>Sphingomonadales</em> genomes. <br> <strong>Gene families in 208 genomes:</strong> <strong>orthogroups.tsv.zip:</strong> Gene families in 208 alphaproteobacterial genomes; the families were constructed using only genes that are at least 300 nucleotides in length. Each line in the file represents one gene family (an orthogroup). In each line, the individual gene family members are identified by RefSeqID of a genome joined by an underscore with RefSeqID of protein sequence of the gene. <br> <strong>GTA gene predictions:</strong> <strong>gta_regions.xlsx: </strong>Predicted GTA ‘head-tail’ clusters in the initial dataset of 212 genomes. The data in the columns for individual GTA genes show their RefSeq accession numbers; empty cells indicate that a gene was not detected in a genome. The 208 genomes that were retained for the selection analyses are highlighted in green. <br> <strong>Effective Number of Codons (ENC) calculations:</strong> <strong>codonW_enc_gc3s.zip:</strong> Effective number of codons (ENC) and GC3s values for genes in 208 alphaproteobacterial genomes that are at least 300 nucleotides in length. Each genome is represented by one file. The individual genes are identified by RefSeqID of a protein. <strong>enc_deviation_gta_genes.xlsx:</strong> Deviation (in %) of Effective Number of Codons (ENC) values of the reference GTA genes in 208 genomes from the null model of no codon bias. Empty cells reflect either absence of a GTA gene from a genome or if its observed ENC been higher than expected (therefore, unreliable due to sampling of codons in a finite gene sequence length). <strong>rel_enc.xlsx:</strong> Deviation of the ENC values of the reference GTA genes in 208 genomes normalized by the average ENC deviation of all genes in a genome. <br> <strong>tRNA Adaptation Index (tAI) calculations:</strong> <strong>stAIcalc_wi.zip: </strong>Codon adaptation indices (wi; i=1-64) estimated by stAIcalc for genes in 208 alphaproteobacterial genomes that are at least 300 nucleotides in length. Each genome is represented by one file. For each genome, codons from all annotated genes were combined to calculate wi values for each codon. <strong>stAIcalc_tAI.zip: </strong>tRNA adaptation (tAI) values for genes in 208 alphaproteobacterial genomes that are at least 300 nucleotides in length. Each genome is represented by one file. For each gene in a genome, calculated tAI value is listed. The individual genes are identified by RefSeqID of a genome joined by an underscore with RefSeqID of protein sequence of the gene. <strong>ptAI_gta_genes.xlsx:</strong> Percentile tAI (ptAI) values of GTA genes of at least 300 nucleotides and with a broad taxonomic representation in 208 genomes. Empty cells reflect absence of a GTA gene in a genome. <br> <strong>Phylogenetic Generalized Least Squares</strong> (<strong>PGLS) analysis:</strong> <strong>orthogroups_PGLS.xlsx:</strong> PGLS model fit (slope and p-value) between individual reference GTA genes and other gene families across 208 genomes. Fourteen gene families (listed in Table 1) that have a significant model fit across all reference GTA genes are highlighted in yellow. <br> <strong>Phylogenetic Analyses:</strong> <strong>reference_aln_tree.zip: </strong>Concatenated alignment of 29 phylogenetic marker genes found in 208 alphaproteobacterial genomes (in FASTA format; reference_alignment.fasta) and reference phylogenomic tree reconstructed from the alignment (in Newick format; reference_tree.nwk). <strong>tonB_aln_tree.zip: </strong>Alignment of the <em>tonB</em> gene homologs (gene family OG0002642) detected in alphaproteobacterial genomes (in FASTA format; tonB_alignment.fasta) and their phylogenetic relationships (in Newick format; tonB_tree.nwk). <strong>tonB_phylogeny.pdf: </strong>The evolutionary history of the <em>tonB</em> gene family. <strong>gafA_aln_tree.zip: </strong>Alignment of the <em>gafA</em> gene homologs (OG0001218) detected in alphaproteobacterial genomes (in FASTA format; gafA_alignment.fasta) and their phylogenetic relationships (in Newick format; gafA_tree.nwk). <strong>gafA_tree_comparisons.pdf:</strong> Phylogenies of gafA, concatenated reference GTA genes and concatenated reference phylogenomic markers of GTA-containing genomes. <strong>ref_gta_aln_tree.zip: </strong>Concatenated alignment of the reference GTA genes in 208 alphaproteobacterial genomes (in FASTA format; ref_gta_alignment.fasta) and phylogenetic tree reconstructed from the alignment (in Newick format; ref_gta_tree.nwk). <br> <strong>Code:</strong> <strong>exp_enc_deviation.py: </strong>Python script that calculates the expected effective number of codons (ENC) based on the GC3s content and the deviation from the expectations under the null model of no codon bias.
## GenBank 登录号(GenBank accession numbers)
**208_genomes_accessions.xlsx**:筛选得到的208株α-变形菌门基因组列表,这些基因组均携带基因转移代理(Gene Transfer Agent, GTA)“头尾”簇。
**g7_replacement_Sphingomonadales.pdf**:11株鞘脂单胞菌目(Sphingomonadales)基因组中推定的g7蛋白对应的GenBank登录号。
## 208株基因组中的基因家族
**orthogroups.tsv.zip**:208株α-变形菌门基因组的基因家族信息;该家族仅由长度≥300 nt的基因构建而成。文件中每行代表一个基因家族(直系同源组,orthogroup),每行内的家族成员通过“基因组RefSeqID_基因蛋白质序列RefSeqID”的格式进行标识。
## GTA基因预测
**gta_regions.xlsx**:初始数据集(212株基因组)中预测得到的GTA“头尾”簇信息。各列中单个GTA基因对应的数据为其RefSeq登录号;空白单元格代表该基因组中未检测到对应基因。后续用于选择分析的208株基因组以绿色高亮标注。
## 密码子有效数(Effective Number of Codons, ENC)计算
**codonW_enc_gc3s.zip**:208株α-变形菌门基因组中长度≥300 nt的基因的密码子有效数(ENC)与GC3s值。每个基因组对应一个单独文件,基因以其蛋白质RefSeqID进行标识。
**enc_deviation_gta_genes.xlsx**:208株基因组中参考GTA基因的密码子有效数(ENC)相对于无密码子偏好性零模型的偏差百分比。空白单元格代表两种情况:该基因组中未检出对应GTA基因,或观测到的ENC值高于预期值(因有限基因序列长度下的密码子采样导致结果不可靠)。
**rel_enc.xlsx**:208株基因组中参考GTA基因的ENC值偏差,该偏差已通过单个基因组内所有基因的平均ENC偏差进行归一化处理。
## tRNA适配指数(tRNA Adaptation Index, tAI)计算
**stAIcalc_wi.zip**:由stAIcalc软件估算得到的208株α-变形菌门基因组中长度≥300 nt的基因的密码子适配指数(wi,i=1~64)。每个基因组对应一个单独文件,通过整合该基因组所有注释基因的密码子信息,计算每个密码子的wi值。
**stAIcalc_tAI.zip**:208株α-变形菌门基因组中长度≥300 nt的基因的tRNA适配指数(tAI)值。每个基因组对应一个单独文件,文件中列出该基因组内每个基因的计算得到的tAI值,基因以“基因组RefSeqID_基因蛋白质序列RefSeqID”的格式标识。
**ptAI_gta_genes.xlsx**:208株基因组中长度≥300 nt且具有广泛分类学代表性的GTA基因的百分位tAI(ptAI)值。空白单元格代表该基因组中未检出对应GTA基因。
## 系统发育广义最小二乘(Phylogenetic Generalized Least Squares, PGLS)分析
**orthogroups_PGLS.xlsx**:208株基因组中各参考GTA基因与其他基因家族之间的PGLS模型拟合结果(斜率与p值)。14个在所有参考GTA基因中均呈现显著模型拟合的基因家族(详见表1)以黄色高亮标注。
## 系统发育分析
**reference_aln_tree.zip**:208株α-变形菌门基因组中29个系统发育标记基因的串联比对文件(FASTA格式,文件名为reference_alignment.fasta),以及基于该比对构建的系统发育参考树(Newick格式,文件名为reference_tree.nwk)。
**tonB_aln_tree.zip**:α-变形菌门基因组中检测到的*tonB*基因同源物(基因家族OG0002642)的比对文件(FASTA格式,文件名为tonB_alignment.fasta)及其系统发育关系(Newick格式,文件名为tonB_tree.nwk)。
**tonB_phylogeny.pdf**:*tonB*基因家族的进化历史。
**gafA_aln_tree.zip**:α-变形菌门基因组中检测到的*gafA*基因同源物(基因家族OG0001218)的比对文件(FASTA格式,文件名为gafA_alignment.fasta)及其系统发育关系(Newick格式,文件名为gafA_tree.nwk)。
**gafA_tree_comparisons.pdf**:含GTA基因组的*gafA*基因、串联参考GTA基因以及串联参考系统发育标记基因的系统发育树对比结果。
**ref_gta_aln_tree.zip**:208株α-变形菌门基因组中参考GTA基因的串联比对文件(FASTA格式,文件名为ref_gta_alignment.fasta),以及基于该比对构建的系统发育树(Newick格式,文件名为ref_gta_tree.nwk)。
## 代码
**exp_enc_deviation.py**:用于基于GC3s含量计算预期密码子有效数(ENC),并得出其相对于无密码子偏好性零模型的偏差的Python脚本。
提供机构:
figshare
创建时间:
2022-09-14



