Metaclusters by DPCfam clustering of UniRef50 v 2017_07
收藏Mendeley Data2024-05-10 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/6900559
下载链接
链接失效反馈官方服务:
资源简介:
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta. Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data. VERSION 1.1 changes: Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_ Added Alphafold representative based on AlphaFoldDB for each MC FILES DESCRIPTION: 1) Standard DPCfam database metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported . metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported . all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema. 2) DPCfamB database B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported . B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported . B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
本数据集为基于UniRef50 v.2017_07版本的DPCfam聚类得到的元簇(Metaclusters)。元簇代表通过DPCfam方法自动推导得到的推定蛋白质家族,相关方法详见《基于密度峰聚类的无监督蛋白质家族分类(Unsupervised protein family classification by Density Peak clustering)》,作者Russo ET,2020年博士学位论文,访问链接:http://hdl.handle.net/20.500.11767/116345。本研究的指导教师为Alessandro Laio与Marco Punta。可访问https://dpcfam.areasciencepark.it/ 便捷浏览本数据集。
版本1.1更新内容:新增DPCfamB数据库,该库收录所有种子序列数满足25≤N<50的小型元簇。DPCfamB相关文件均以前缀B_命名。新增基于AlphaFold数据库(AlphaFoldDB)的每个元簇的AlphaFold代表序列。
### 文件说明
1. 标准DPCfam数据库
- metaclusters_xml.tar.gz:元簇种子序列的未比对XML表格文件,仅收录种子序列满足以下两个条件的元簇:1) 元素数量大于50;2) 平均长度大于50个氨基酸。元簇条目包含该元簇的多项统计信息(如大小、平均长度、低复杂度占比等)以及Pfam(Pfam)比对结果(优势结构域架构)。附带README文件说明数据集内容,同时提供将XML数据转换为空格分隔表格的解析工具,以及对应的XML Schema文件。
- metaclusters_msas.tar.gz:元簇的多序列比对文件,格式为FASTA,筛选条件与上述metaclusters_xml.tar.gz一致。
- metaclusters_hmms.tar.gz:元簇的隐马尔可夫模型谱(profile-HMMs)文件,每个元簇对应一个.hmm文件,筛选条件同上。
- all_metaclusters_hmm.tar.gz:整合所有元簇的隐马尔可夫模型谱的单一.hmm文件,筛选条件同上。
- uniref50_annotated.xml.gz:经过Pfam家族与DPCfam元簇注释的UniRef50 v.2017_07数据库。附带README文件说明数据集内容,提供将XML数据转换为空格分隔表格的解析工具与XML Schema文件,该Schema源自UniProt的UniRef50 XML Schema。
2. DPCfamB数据库
- B_metaclusters_xml.tar.gz:元簇种子序列的未比对XML表格文件,收录全部元簇。元簇条目包含该元簇的多项统计信息(如大小、平均长度、低复杂度占比等)以及Pfam比对结果(优势结构域架构)。附带README文件说明数据集内容,同时提供将XML数据转换为空格分隔表格的解析工具,以及对应的XML Schema文件。
- B_metaclusters_msas.tar.gz:元簇的多序列比对文件,格式为FASTA,仅收录种子序列满足以下条件的元簇:1) 元素数量25≤N<50;2) 平均长度大于50个氨基酸。
- B_metaclusters_hmms.tar.gz:元簇的隐马尔可夫模型谱文件,每个元簇对应一个.hmm文件,筛选条件与上述B_metaclusters_msas.tar.gz一致。
- B_all_metaclusters_hmm.tar.gz:整合所有元簇的隐马尔可夫模型谱的单一.hmm文件,筛选条件同上。
创建时间:
2023-06-28



