tattabio/OMG_prot50
收藏Hugging Face2024-08-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/tattabio/OMG_prot50
下载链接
链接失效反馈官方服务:
资源简介:
`OMG_prot50`数据集是一个仅包含蛋白质的数据集,通过对Open MetaGenomic数据集进行50%序列同一性聚类生成。使用MMseqs2 linclust工具对OMG数据集中的所有42亿个蛋白质序列进行聚类,最终得到2.07亿个蛋白质序列。聚类过程中,序列同一性为50%,序列覆盖率为90%,并且移除了单例簇。
The `OMG_prot50` dataset is a protein-only dataset, created by clustering the Open MetaGenomic dataset ([`OMG`](https://huggingface.co/datasets/tattabio/OMG)) at 50% sequence identity. MMseqs2 linclust (Steinegger and Söding 2018) was used to cluster all 4.2B protein sequences from the OMG dataset, resulting in 207M protein sequences. Sequences were clustered at 50% sequence id and 90% sequence coverage, and singleton clusters were removed.
提供机构:
tattabio



