yarongef/human_proteome_singlets

Name: yarongef/human_proteome_singlets
Creator: yarongef
Published: 2022-09-21 08:45:02
License: 暂无描述

Hugging Face2022-09-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/yarongef/human_proteome_singlets

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集来源于UniProt人类蛋白质组，包含20,577个人类蛋白质。经过筛选，去除长度小于20个氨基酸或大于512个氨基酸的序列后，得到12,703个蛋白质。随后使用uShuffle算法对这些蛋白质序列进行随机化处理，保持其单态分布。最后，使用h-CD-HIT算法在三个不同的配对同一性阈值（0.9、0.5和0.1）下进行过滤，最终得到11,698个序列。

This dataset is sourced from the UniProt human proteome, which initially contains 20,577 human proteins. After filtering out sequences with lengths less than 20 amino acids or greater than 512 amino acids, 12,703 protein sequences were retained. Subsequently, the uShuffle algorithm was employed to randomly shuffle these protein sequences while preserving their monopeptide composition distribution. Finally, the h-CD-HIT algorithm was used for filtration under three distinct pairwise identity thresholds (0.9, 0.5, and 0.1), resulting in a final set of 11,698 protein sequences.

提供机构：

yarongef

原始信息汇总

数据集概述

数据集来源与处理

原始数据包含20,577个人类蛋白质序列，来源于UniProt人类蛋白质组。
经过筛选，移除了序列长度小于20或大于512氨基酸的蛋白质，剩余12,703个蛋白质。
使用uShuffle算法对这些蛋白质序列进行随机打乱，同时保持其单体分布。
进一步通过h-CD-HIT算法，在三个不同的成对身份阈值（0.9, 0.5, 0.1）下进行过滤，最终得到11,698个序列。

数据集引用

若使用此数据集，请引用以下文献：

@article { author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron}, title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts}, year = {2022}, doi = {10.1093/bioinformatics/btac474}, URL = {https://doi.org/10.1093/bioinformatics/btac474}, journal = {Bioinformatics} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集