yarongef/human_proteome_singlets
收藏Hugging Face2022-09-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/yarongef/human_proteome_singlets
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于UniProt人类蛋白质组,包含20,577个人类蛋白质。经过筛选,去除长度小于20个氨基酸或大于512个氨基酸的序列后,得到12,703个蛋白质。随后使用uShuffle算法对这些蛋白质序列进行随机化处理,保持其单态分布。最后,使用h-CD-HIT算法在三个不同的配对同一性阈值(0.9、0.5和0.1)下进行过滤,最终得到11,698个序列。
This dataset is sourced from the UniProt human proteome, which initially contains 20,577 human proteins. After filtering out sequences with lengths less than 20 amino acids or greater than 512 amino acids, 12,703 protein sequences were retained. Subsequently, the uShuffle algorithm was employed to randomly shuffle these protein sequences while preserving their monopeptide composition distribution. Finally, the h-CD-HIT algorithm was used for filtration under three distinct pairwise identity thresholds (0.9, 0.5, and 0.1), resulting in a final set of 11,698 protein sequences.
提供机构:
yarongef
原始信息汇总
数据集概述
数据集来源与处理
- 原始数据包含20,577个人类蛋白质序列,来源于UniProt人类蛋白质组。
- 经过筛选,移除了序列长度小于20或大于512氨基酸的蛋白质,剩余12,703个蛋白质。
- 使用uShuffle算法对这些蛋白质序列进行随机打乱,同时保持其单体分布。
- 进一步通过h-CD-HIT算法,在三个不同的成对身份阈值(0.9, 0.5, 0.1)下进行过滤,最终得到11,698个序列。
数据集引用
若使用此数据集,请引用以下文献:
@article { author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron}, title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts}, year = {2022}, doi = {10.1093/bioinformatics/btac474}, URL = {https://doi.org/10.1093/bioinformatics/btac474}, journal = {Bioinformatics} }



