Supplementary material for "Evolution of sequence-diverse disordered regions in a protein family: order within the chaos"
收藏DataCite Commons2024-12-02 更新2025-04-16 收录
下载链接:
https://opal.latrobe.edu.au/articles/dataset/Supplementary_material_for_Evolution_of_sequence-diverse_disordered_regions_in_a_protein_family_order_within_the_chaos_/11775024/1
下载链接
链接失效反馈官方服务:
资源简介:
<b>A set of supplementary data files</b><i>Accompanying publication:</i> Evolution of sequence-diverse
disordered regions in a protein family: order within the chaos
<b><br></b><b>Supp data file 1</b>Excel file for the 2644 fasciclin domains, names and annotation
information. In order to keep names short for phylogenies, FLAs given arbitrary
identifier numbers, and fasciclin domains within them indicated by their (e.g. “>X1234_FLA.2.3”
-> Fasciclin domain cluster 1, arbitrary FLA identifier number 1234, FLA fasciclin
domain 2 out of 3). Numbers and colours given for fasciclin, AG, non-AG and inter-proline
clusters.<i>Fields:</i> name = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; also used in alignments and phylogenies).number = Arbitrary ID number for the FLA sequence· Accession = Phytosome gene sequence ID for the FLA sequencefas.count = Which fasciclin domain is this within the FLA sequencefas.max = How many total fasciclin domains are in the FLA sequencefas.clust(PCA) = Initial cluster based on PCA+MClust of fasciclin domain sequencefas.clust = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)agreg.clust = Cluster based on UMAP+HDBSCAN of arabinogalactan regions (0=no cluster assigned,1=type a, 2=type b, etc.)nagreg.clust = Cluster based on UMAP+HDBSCAN of non-arabinogalactan non-fasciclin regions (0=no cluster assigned,1=type a, 2=type b, etc.)interP.clust = Cluster based on UMAP+HDBSCAN of inter-proline distance (0=no cluster assigned,1=type a, 2=type b, etc.)genus & species = Taxonomy of the organism containing the sequencetax.name = Broad taxonomic group of organism containing the sequence (not necessarily a monophyletic group)[x].col = colour used in diagrams for sequences in that clusterngly.site.[x] = Boolean (true/false) of whether the sequence contains an nglycosylation motif at that position in the sequence. Includes a number to indicate domain within a FLA with 2-fasciclin domains (see figs 2 & S8 for positions)<b>Supp data file 2</b><br>Multiple sequence alignments as fasta files for all 2644 fasciclin domains, as well as separately for each cluster A-R.<b><br></b><i>Naming:</i>Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)<b><br></b><b>Supp data file 3</b><b><br></b>Phylogenies as newick files for all 2644 fasciclin domains, as well as separately for each cluster A-R.<b><br></b><i>Naming:</i>Sequence names = sequence name (constructed as: G[fas.clust] X[number] fas [fas.count] of [fas.max] ; see supp data file 1 fields)File names = Cluster based on UMAP+HDBSCAN of fasciclin domain sequence (0=no cluster assigned, 1=type A, 2=type B, etc.)<b>Supp data file 4</b><br>An [R] script to perform the analyses shown in the publication. See also github repo TS404/FLAnnotator.<br><br>
**一组附属数据集文件**
*关联发表论文:* 蛋白质家族中序列多样性无序区域的演化:混沌中的秩序
**附属数据集文件1**
包含2644个**成束蛋白结构域(fasciclin domains)**的名称与注释信息的Excel文件。为简化系统发育分析中的序列名称,为FLA序列分配了随机标识符编号,并通过编号标识其内部的成束蛋白结构域(例如,“>X1234_FLA.2.3”对应:成束蛋白结构域簇1,随机FLA标识符编号1234,该FLA序列中的第2个共3个成束蛋白结构域)。为成束蛋白、阿拉伯半乳聚糖(arabinogalactan, AG)、非阿拉伯半乳聚糖及脯氨酸间区域的聚类分配了编号与配色。
*字段说明:*
- `name`(序列名称):命名格式为 `G[fas.clust] X[编号] fas [fas.count] of [fas.max]`,该命名同时用于多序列联配与系统发育分析。
- `number`(随机编号):FLA序列的自定义标识符编号。
- `Accession`(登录号):对应FLA序列的Phytosome基因序列ID。
- `fas.count`(结构域序号):该成束蛋白结构域在FLA序列中的位置序号。
- `fas.max`(结构域总数):该FLA序列包含的成束蛋白结构域总数量。
- `fas.clust(PCA)`(PCA初始聚类):基于成束蛋白结构域序列的主成分分析(PCA)+MClust聚类得到的初始聚类结果。
- `fas.clust`(UMAP+HDBSCAN聚类):基于成束蛋白结构域序列的均匀流形近似与投影(UMAP)+基于密度的聚类应用与噪声(HDBSCAN)分析得到的聚类结果(0表示未分配聚类,1表示A型,2表示B型,依此类推)。
- `agreg.clust`(AG区域聚类):基于阿拉伯半乳聚糖区域的UMAP+HDBSCAN分析得到的聚类结果(0表示未分配聚类,1表示a型,2表示b型,依此类推)。
- `nagreg.clust`(非AG非成束蛋白区域聚类):基于非阿拉伯半乳聚糖、非成束蛋白区域的UMAP+HDBSCAN分析得到的聚类结果(0表示未分配聚类,1表示a型,2表示b型,依此类推)。
- `interP.clust`(脯氨酸间距离聚类):基于脯氨酸间距离的UMAP+HDBSCAN分析得到的聚类结果(0表示未分配聚类,1表示a型,2表示b型,依此类推)。
- `genus & species`(属种分类):该序列所属生物体的分类学属种信息。
- `tax.name`(大类群分类):该序列所属生物体的宽泛分类类群(不一定为单系群)。
- `[x].col`(聚类配色):该聚类序列在可视化图表中使用的颜色。
- `gly.site.[x]`(糖基化位点):布尔型字段(真/假),表示该序列在对应位置是否存在N-糖基化基序。若FLA序列包含2个成束蛋白结构域,则添加序号以区分不同结构域的位点(具体位置参见图2与补充图S8)。
**附属数据集文件2**
包含全部2644个成束蛋白结构域的多序列联配结果的FASTA格式文件,同时包含按A-R各聚类分别整理的独立联配文件。
*命名规则:*
- 序列名称:采用与附属数据集文件1一致的命名格式(`G[fas.clust] X[编号] fas [fas.count] of [fas.max]`,详见附属数据集文件1的字段说明)。
- 文件名称:基于成束蛋白结构域序列的UMAP+HDBSCAN聚类结果命名(0表示未分配聚类,1表示A型,2表示B型,依此类推)。
**附属数据集文件3**
包含全部2644个成束蛋白结构域的系统发育树的NEWICK格式文件,同时包含按A-R各聚类分别整理的独立系统发育树文件。
*命名规则:*
- 序列名称:采用与附属数据集文件1一致的命名格式(`G[fas.clust] X[编号] fas [fas.count] of [fas.max]`,详见附属数据集文件1的字段说明)。
- 文件名称:基于成束蛋白结构域序列的UMAP+HDBSCAN聚类结果命名(0表示未分配聚类,1表示A型,2表示B型,依此类推)。
**附属数据集文件4**
用于复现论文中所有分析流程的R脚本(R script)文件,相关代码亦可通过GitHub仓库TS404/FLAnnotator获取。
提供机构:
La Trobe
创建时间:
2024-12-02



