five

Unified Human Gastrointestinal Proteome clustering results by DPCfam

收藏
Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/record/7335147
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains the result of clustering the Unified Human Gastrointestinal Proteome using the DPCfam algorithm. More details on the DPCfam clustering algorithm can be found in the original publication: Russo, Elena Tea, et al. "DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets." PLOS Computational Biology 18.10 (2022): e1010610. https://doi.org/10.1371/journal.pcbi.1010610 All of the putative protein families obtained through DPCfam (including previous results) can be browsed online at our dedicated webserver: https://dpcfam.areasciencepark.it/uhgp The original protein dataset is version 1.0 of the UHGP-50 dataset, available for download from MGnify at https://www.ebi.ac.uk/metagenomics/. FILES DESCRIPTION: Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 aminoacids are reported. metaclusters_xml.tar.gz: dpcfam_uhgp_metaclusters.xml: Metaclusters' seeds. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction, etc.) and Pfam comparison (Dominant Architecture). dpcfam_metaclusters.xsd: XML schema file for the data. MCxml_to_tables.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script. parse.sh: XML parser. README.md uhgp_xml.tar.gz: uhgp_proteins.xml: XML file containing all of UHGP-50 proteins and its corresponding sequences, annotated with Pfam and DPCfam metacluster data. Annotations comprise the membership of a protein as a seed or matches found though the profile-hmms of the DPCfam-UHGP and the DPCfam-Uniref clusterings. uhgp_matches.xsd: XML schema file for the data. xml_to_list.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script. xml_to_list_mcfiles.awk: Awk script to convert from XML to tabular text files (including individual files for metaclusters' seeds). Use through the parse.sh script. parse.sh: XML parser. README.md Metacluster Files: seeds.zip: Metaclusters' seed sequences. A fasta file for each metacluster before filtering. filtered_seeds.zip: Metaclusters' seed sequences after clustering at 60 percent identity. metaclusters_hmms.tar.gz: Metaclusters' profile-hmms. A ".hmm" file for each metacluser. metaclusters_msas.tar.gz: Metaclusters' multiple sequence alignments, in fasta format.

本数据集包含使用DPCfam算法对统一人类肠道蛋白质组(Unified Human Gastrointestinal Proteome, UHGP)进行聚类得到的结果。有关DPCfam聚类算法的更多细节可参阅原始研究论文:Russo, Elena Tea 等人的"DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets",发表于《PLOS Computational Biology》2022年第18卷第10期,论文编号e1010610,DOI: 10.1371/journal.pcbi.1010610。通过DPCfam获得的所有推定蛋白质家族(包括既往研究结果)均可通过专用在线服务器浏览:https://dpcfam.areasciencepark.it/uhgp。原始蛋白质数据集为UHGP-50数据集的1.0版本,可从MGnify平台(https://www.ebi.ac.uk/metagenomics/)下载获取。 ## 文件说明 仅收录满足以下两项条件的元簇(Metaclusters, MCs)种子:1)元素数量超过50;2)平均长度大于50个氨基酸。 ### metaclusters_xml.tar.gz - dpcfam_uhgp_metaclusters.xml:元簇的种子信息文件。元簇条目还包含每个元簇(MC)的多项统计信息(如簇大小、平均长度、低复杂度序列占比等)以及Pfam比对结果(主导架构)。 - dpcfam_metaclusters.xsd:该数据集的XML模式文件。 - MCxml_to_tables.awk:用于将XML格式转换为表格文本文件的Awk脚本,需通过parse.sh脚本运行。 - parse.sh:XML解析器。 - README.md ### uhgp_xml.tar.gz - uhgp_proteins.xml:包含所有UHGP-50蛋白质及其对应序列的XML文件,已标注Pfam和DPCfam元簇相关信息。标注内容涵盖蛋白质作为种子的成员身份,或通过DPCfam-UHGP与DPCfam-UniRef聚类的轮廓隐马尔可夫模型(profile-HMMs)匹配得到的结果。 - uhgp_matches.xsd:该数据集的XML模式文件。 - xml_to_list.awk:用于将XML格式转换为表格文本文件的Awk脚本,需通过parse.sh脚本运行。 - xml_to_list_mcfiles.awk:用于将XML格式转换为表格文本文件的Awk脚本(包含元簇种子的单独输出文件),需通过parse.sh脚本运行。 - parse.sh:XML解析器。 - README.md ### 元簇文件 - seeds.zip:元簇的种子序列压缩包,每个元簇在过滤前对应一个FASTA格式文件。 - filtered_seeds.zip:经过60%序列同一性聚类过滤后的元簇种子序列压缩包。 - metaclusters_hmms.tar.gz:元簇的轮廓隐马尔可夫模型(profile-HMMs)压缩包,每个元簇对应一个".hmm"格式文件。 - metaclusters_msas.tar.gz:元簇的多序列比对文件压缩包,格式为FASTA。
创建时间:
2023-06-28
二维码
社区交流群
二维码
科研交流群
商业服务