tattabio/OG_prot90
收藏Hugging Face2024-11-18 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/tattabio/OG_prot90
下载链接
链接失效反馈官方服务:
资源简介:
`OG_prot90`数据集是一个仅包含蛋白质的数据集,通过对Open Genomic数据集进行90%序列同一性和90%序列覆盖率的聚类生成。该数据集包含了8500万条蛋白质序列,是使用MMseqs2 linclust工具对Open Genomic数据集中的4亿条蛋白质序列进行聚类后得到的。
The `OG_prot90` dataset is a protein-only dataset, created by clustering the Open Genomic dataset ([`OG`](https://huggingface.co/datasets/tattabio/OG)) at 90% sequence identity. MMseqs2 linclust (Steinegger and Söding 2018) was used to cluster all 400M protein sequences from the OG dataset, resulting in 85M protein sequences. Sequences were clustered at 90% sequence id and 90% sequence coverage.
提供机构:
tattabio



