microsoft/Dayhoff
收藏Hugging Face2026-04-02 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/microsoft/Dayhoff
下载链接
链接失效反馈官方服务:
资源简介:
Dayhoff数据集是一个集成了蛋白质序列数据和生成语言模型的资源,包含了3.34亿个蛋白质序列,分布在1.7亿个聚类中,这些序列来自宏基因组、天然蛋白质序列、结构衍生的合成序列以及同源序列。该数据集用于训练蛋白质语言模型,能够预测突变对蛋白质适应性的影响,根据进化或结构背景生成支架结构基序,以及在指定的家族内指导生成新型蛋白质。Dayhoff模型结合了状态空间Mamba层和Transformer自注意力机制,并使用了专家混合模块来最大化容量同时保持效率。
The Dayhoff dataset is a resource that integrates protein sequence data and generative language models, containing 334 million protein sequences across 170 million clusters of metagenomic, natural protein sequences, structure-derived synthetic sequences, and homologs. This dataset is used for training protein language models, capable of predicting the effects of mutations on fitness, generating scaffold structural motifs conditional on evolutionary or structural context, and guiding the generation of novel proteins within specified families. The Dayhoff model architecture combines state-space Mamba layers with Transformer self-attention and uses Mixture-of-Experts modules to maximize capacity while preserving efficiency.
提供机构:
microsoft



