Dayhoff
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/microsoft/Dayhoff
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Dayhoff
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
The Dayhoff model architecture combines state-space Mamba layers with Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
## Dataset Structure
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
* **[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
* _Splits: train (25 GB), test (26 MB), valid (26 MB)_
* **[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
* _Splits: train (83 GB), test (90 MB), valid (87 MB)_
* **GigaRef** (**GR**)– 3.34B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
* **GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
* _Splits: train (433 GB), test (22 MB)_
* **GigaRef-singletons** (**GR-s**) - Only includes singletons
* _Splits: train (282 GB)_
* **BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
* **BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
* _Splits: train (3 GB)_
* **BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
* _Splits: train(3 GB)_
* **BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
* _Splits: train (3GB)_
* **BackboneRef structures** (**backboneref-structures**) - We also make available the structures used to generate BackboneRef datasets
* **[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains. (_Not currently available in HF_)
### DayhoffRef
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
* **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
* _Splits: train (5 GB)_
## Uses
### Direct Use
The dataset is intended for training protein language models.
### Sample code
To read any of the datasets use Hugging Face's `load_dataset`, which will use the Arrow format for maximum efficieny. Below are some examples on how to load the datasets:
```py
gigaref_singletons_train = load_dataset("microsoft/Dayhoff",
name="gigaref_only_singletons",
split="train")
gigaref_clustered_train = load_dataset("microsoft/Dayhoff",
name="gigaref_no_singletons",
split="train")
uniref50_train = load_dataset("microsoft/Dayhoff",
name="uniref50",
split = "train")
uniref90_train = load_dataset("microsoft/Dayhoff",
name="uniref90",
split = "train")
backboneref_novelty = load_dataset("microsoft/Dayhoff",
name="backboneref",
split = "BRn")
dayhoffref = load_dataset("microsoft/Dayhoff",
name="dayhoffref",
split = "train")
```
For the largest datasets, consider using streaming = True.
### Out-of-Scope Use Cases
This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences.
## Dataset Creation
### Curation Rationale
The motivation for creating the Dayhoff Atlas was to systematically combine genomic-derived protein sequences, metagenomics, structure-based synthetic sequences, and homologs to enhance protein language models (PLMs). With this dataset we aim to expand the diversity and scale of natural protein sequences available for model training, infuse structural information into sequence space, and unify different sources of protein data into a centralized resource
### Gigaref
#### Data Collection and Processing
A multi-step clustering process was performed to combine, deduplicate, and recluster these datasets into **GigaRef**. MERC and SRC were combined and preclustered at 70\% identity, and MGnify was preclustered at 70\% identity. The representatives resulting from each of these preclustering steps were then combined with sequences from SMAG, MetaEuk, MGV, GPD, TOPAZ, and UniRef100, yielding a total of 3.94 billion sequences. These were clustered at 90\% sequence identity followed by clustering of the resulting representatives at 50\% identity to produce the GigaRef dataset.
#### Who are the source data producers?
The source data producers include various metagenomic databases such as MGnify, Soil Metagenome-Assembled Genomes (SMAG), MetaEuk, Metagenomic Gut Virus catalog (MGV), Gut Phage Database (GPD), Soil Reference Catalog (SRC), Marine Eukaryotic Reference Catalog (MERC), and Tara Oceans Particle-Associated MAGs (TOPAZ).
### BackboneRef
#### Data Collection and Processing
**BackboneRef** was generated by sampling backbone structures from RFDiffusion and designing synthetic sequences using ProteinMPNN. Structures were sampled according to the length distribution of UniRef50, to recapitulate the lengths of natural proteins but with a minimum length of 40 and maximum length of 512 for computational efficiency. An initial set of 300 sequences per backbone at temperature 0.2 was generated, producing a parent dataset with 72,243,300 sequences.
To create the **BRu** dataset, 42 sequences were selected per backbone from this parent dataset, yielding 10,114,860 sequences. Exact duplicates were removed, then the remaining sequences were randomly subsampled to produce a dataset of 10M sequences.
To create the **BRq** dataset, backbones with average scRMSD score greater than 2Å were removed, leaving 127,633 backbones. 80 sequences per backbone were randomly selected from the parent dataset, and again exact duplicates were removed, followed by random subsampling to produce a dataset of 10M sequences.
To create the **BBn** dataset, any backbones with maximum TM-score larger than 0.5 to any structure in the `AFDB/UniProt` database were removed, leaving 138,044 backbones. 74 sequences per backbone were randomly sampled from the parent dataset, and again exact duplicates were removed, followed by random subsampling to produce a dataset of 10M sequences.
#### Who are the source data producers
This is a synthetically generated dataset.
### DayhoffRef
#### Data Collection and Processing
DayhoffRef sequences were sampled in both the N-to-C and C-to-N directions from Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-C, and Dayhoff-170m-UR50-BRn at temperatures 0.9 and 1.0 on 8x AMD MI300X GPUs for each combination of model, direction, and temperature. Clustering to 90% identity was performed with MMseqs2 using default parameters, and then the cluster representatives were clustered to 50\% identity using default parameters.
#### Who are the source data producers
This is a synthetically generated dataset.
## Bias, Risks, and Limitations
This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences. Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.
## Responsible AI Considerations
The intended use of this model is to generate high-quality, realistic, protein sequences or sets of homologous protein sequences. Generations can be designed from scratch or conditioned on partial sequences in both N→C and C→N directions.
The code and datasets released in this repository are provided for research and development use only. They are not intended for use in clinical decision-making or for any other clinical use, and the performance of these models for clinical use has not been established. You bear sole responsibility for any use of these models, data and software, including incorporation into any product intended for clinical use.
# Dayhoff数据集卡片
Dayhoff是集蛋白质序列数据与生成式语言模型于一体的图谱,作为一项中心化资源,整合了17亿个宏基因组及天然蛋白质序列簇中的33.4亿条蛋白质序列(GigaRef)、4600万条结构衍生合成序列(BackboneRef)以及1600万条多序列比对结果(OpenProteinSet)。这些模型可原生实现零样本(zero-shot)突变适配性效应预测、基于进化或结构上下文调控的结构基序支架构建,以及在指定蛋白质家族内完成靶向引导的新型蛋白质生成。通过从Dayhoff图谱的宏基因组及结构衍生合成数据中学习,生成的蛋白质的细胞表达率得到提升,这凸显了拓展蛋白质序列数据的规模、多样性与新颖性的现实应用价值。
Dayhoff的模型架构将状态空间Mamba层与Transformer自注意力机制相结合,并穿插混合专家模型(Mixture-of-Experts)模块,在保证效率的同时最大化模型容量。该架构原生支持长上下文建模,可同时对单条序列与展开后的多序列比对(MSAs)进行建模。以N端→C端与C端→N端的自回归目标进行训练,Dayhoff支持与序列顺序无关的填充任务,并可扩展至数十亿参数量级。
## 数据集结构
Dayhoff模型基于Dayhoff图谱进行训练,所用数据组合包含以下类型:
* **[UniRef50](https://www.uniprot.org/)(UR50)**:源自UniProt的数据集,以50%序列同一性进行聚类,仅保留簇代表序列。
* 数据集划分:训练集(25 GB)、测试集(26 MB)、验证集(26 MB)
* **[UniRef90](https://www.uniprot.org/)(UR90)**:源自UniProt的数据集,以90%序列同一性进行聚类,包含簇代表序列与簇内成员序列。
* 数据集划分:训练集(83 GB)、测试集(90 MB)、验证集(87 MB)
* **GigaRef(GR)**:包含17亿个宏基因组及天然蛋白质序列簇对应的33.4亿条蛋白质序列。该数据集包含两个子集:
* **GigaRef-clusters(GR)**:仅包含簇代表序列与簇内成员序列,不含单序列(singletons)
* 数据集划分:训练集(433 GB)、测试集(22 MB)
* **GigaRef-singletons(GR-s)**:仅包含单序列
* 数据集划分:训练集(282 GB)
* **BackboneRef(BR)**:源自约24万个全新设计骨架(de novo backbones)的4600万条结构衍生合成序列,包含三个各含1000万条序列的子集:
* **BackboneRef unfiltered(BRu)**:从全部4600万条设计序列中随机采样得到的1000万条序列
* 数据集划分:训练集(3 GB)
* **BackboneRef quality(BRq)**:从127633个骨架中采样得到的1000万条序列,这些骨架的平均自洽性均方根偏差(self-consistency RMSD)≤2 Å
* 数据集划分:训练集(3 GB)
* **BackboneRef novelty(BRn)**:源自138044个骨架的1000万条序列,这些骨架与任意天然结构的最大模板建模得分(Template Modeling Score,TM-score)<0.5
* 数据集划分:训练集(3 GB)
* **BackboneRef structures(backboneref-structures)**:同步提供用于生成BackboneRef数据集的原始骨架结构
* **[OpenProteinSet](https://arxiv.org/abs/2308.05326)(HM)**:源自UniClust30的1600万条序列与14万个PDB链的1600万条预计算多序列比对结果(MSAs)。(当前未在Hugging Face平台上线)
### DayhoffRef数据集
考虑到生成式模型可拓展蛋白质及其功能的空间边界,我们使用Dayhoff模型生成了DayhoffRef——一个由蛋白质语言模型(Protein Language Model,PLM)生成的合成蛋白质序列数据库。
* **DayhoffRef**:由Dayhoff模型生成的1600万条合成蛋白质序列数据集,涵盖Dayhoff-3b-UR90、Dayhoff-3b-GR-HM、Dayhoff-3b-GR-HM-c以及Dayhoff-170m-UR50-BRn四类模型生成的序列。
* 数据集划分:训练集(5 GB)
## 使用场景
### 直接使用场景
本数据集旨在用于蛋白质语言模型的训练。
### 示例代码
可通过Hugging Face的`load_dataset`工具读取任意数据集,该工具将采用Arrow格式以实现最优效率。以下为数据集加载示例:
py
gigaref_singletons_train = load_dataset("microsoft/Dayhoff",
name="gigaref_only_singletons",
split="train")
gigaref_clustered_train = load_dataset("microsoft/Dayhoff",
name="gigaref_no_singletons",
split="train")
uniref50_train = load_dataset("microsoft/Dayhoff",
name="uniref50",
split = "train")
uniref90_train = load_dataset("microsoft/Dayhoff",
name="uniref90",
split = "train")
backboneref_novelty = load_dataset("microsoft/Dayhoff",
name="backboneref",
split = "BRn")
dayhoffref = load_dataset("microsoft/Dayhoff",
name="dayhoffref",
split = "train")
对于超大规模数据集,建议启用`streaming=True`参数。
### 超出适用范围的使用场景
本模型不得用于生成非蛋白质序列或非同源蛋白质序列集合的内容,不适用于自然语言或其他生物序列(如DNA序列)的生成。
## 数据集构建
### 数据遴选依据
构建Dayhoff图谱的初衷在于系统性整合基因组来源的蛋白质序列、宏基因组数据、结构衍生合成序列以及同源序列,以提升蛋白质语言模型(PLMs)的性能。本数据集旨在拓展用于模型训练的天然蛋白质序列的多样性与规模,将结构信息融入序列空间,并将多来源的蛋白质数据整合为一个中心化资源。
### GigaRef数据集
#### 数据收集与处理流程
通过多阶段聚类流程完成多数据集的合并、去重与重新聚类,最终得到**GigaRef**数据集。将MERC与SRC合并后以70%序列同一性进行预聚类,同时将MGnify以70%序列同一性进行预聚类;将上述预聚类得到的代表序列与SMAG、MetaEuk、MGV、GPD、TOPAZ及UniRef100的序列合并,最终得到总计39.4亿条序列。随后以90%序列同一性对这些序列进行聚类,再对得到的代表序列以50%序列同一性进行二次聚类,最终得到GigaRef数据集。
#### 数据源提供者
数据源涵盖多个宏基因组数据库,包括MGnify、土壤宏基因组组装基因组(Soil Metagenome-Assembled Genomes,SMAG)、MetaEuk、肠道宏病毒目录(Metagenomic Gut Virus catalog,MGV)、肠道噬菌体数据库(Gut Phage Database,GPD)、土壤参考目录(Soil Reference Catalog,SRC)、海洋真核生物参考目录(Marine Eukaryotic Reference Catalog,MERC)以及塔拉海洋颗粒相关宏基因组组装基因组(Tara Oceans Particle-Associated MAGs,TOPAZ)。
### BackboneRef数据集
#### 数据收集与处理流程
**BackboneRef**通过从RFDiffusion中采样骨架结构,并使用ProteinMPNN设计合成序列生成。骨架结构的采样遵循UniRef50的长度分布,以复现天然蛋白质的长度范围,同时为保证计算效率,将序列长度限制为40至512个氨基酸残基。针对每个骨架,在温度参数为0.2的条件下生成300条初始序列,最终得到包含72243300条序列的母数据集。
为构建**BRu**数据集,从母数据集的每个骨架对应的序列中选取42条,得到10114860条序列;移除完全重复的序列后,随机下采样至1000万条序列,最终得到BRu数据集。
为构建**BRq**数据集,移除平均自洽性均方根偏差(self-consistency RMSD,scRMSD)大于2Å的骨架,最终保留127633个骨架;从母数据集的每个骨架对应的序列中随机选取80条,再次移除完全重复的序列后,随机下采样至1000万条序列,得到BRq数据集。
为构建**BRn**数据集,移除与`AFDB/UniProt`数据库中任意结构的最大模板建模得分(Template Modeling Score,TM-score)大于0.5的骨架,最终保留138044个骨架;从母数据集的每个骨架对应的序列中随机采样74条,再次移除完全重复的序列后,随机下采样至1000万条序列,得到BRn数据集。
#### 数据源提供者
本数据集为人工合成生成的数据集。
### DayhoffRef数据集
#### 数据收集与处理流程
DayhoffRef序列通过以下方式生成:基于Dayhoff-3b-GR-HM、Dayhoff-3b-GR-HM-C以及Dayhoff-170m-UR50-BRn四类模型,分别以N端→C端与C端→N端两个方向,在温度参数为0.9和1.0的条件下进行采样;所有模型、方向与温度参数的组合均通过8块AMD MI300X GPU完成训练。随后使用MMseqs2以默认参数将序列聚类至90%序列同一性,再对得到的簇代表序列以默认参数聚类至50%序列同一性。
#### 数据源提供者
本数据集为人工合成生成的数据集。
## 偏差、风险与局限性
本模型不得用于生成非蛋白质序列或非同源蛋白质序列集合的内容,不适用于自然语言或其他生物序列(如DNA序列)的生成。并非所有生成的序列都具备生物学真实性,目前仍难以生成与任意天然序列均无序列同源性的高质量蛋白质序列。
## 负责任人工智能考量
本模型的预期使用场景为生成高质量、具备生物学真实性的蛋白质序列或同源蛋白质序列集合,可从零开始设计生成,或基于部分序列以N端→C端与C端→N端两个方向进行条件生成。
本仓库发布的代码与数据集仅用于研发目的,不得用于临床决策或其他临床用途,目前尚未验证这些模型在临床场景下的性能。任何对本模型、数据集与软件的使用(包括将其整合至拟用于临床用途的产品中)均由使用者自行承担全部责任。
提供机构:
maas
创建时间:
2025-07-26



