A curriculum learning approach to training antibody language models
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14661301
下载链接
链接失效反馈官方服务:
资源简介:
Motivation. There is growing interest in antibody language models (AbLMs) being pre-trained with a mixture of unpaired and natively paired sequencing data. This is due to the proven benefits of pre-training with natively paired sequences, but their relative sparity compared to unpaired data. Existing models trained with unpaired and paired data typically use a finetuning approach to pre-training, but this can result in catastrophic forgetting of the unpaired sequences. We explore if a curriculum learning approach to pre-training could help address these issues.
Results. We introduce a modified version of curriculum learning for training AbLMs, to modify the sampling of training data throughout training. This results in a gradual transition from unpaired to paired data throughout training. We optimize this approach and observe that our 650M-parameter curriculum model, CurrAb, outperforms existing AbLMs in downstream classification tasks.
Files. The following files are included in this repository:
CurrAb.tar.gz: Model weights for the CurrAb model. Model can also be downloaded from HuggingFace.
TTE_paired-downsampled.tar.gz: Downsampled paired datasets used to train, test, and evaluate the 55M parameter models.
TTE_unpaired-downsampled.tar.gz: Downsampled unpaired datasets used to train, test, and evaluate the 55M parameter models.
TTE_paired-full.tar.gz: Full paired datasets used to train, test, and evaluate the 650M parameter models including CurrAb.
TTE_unpaired-full.tar.gz: Full unpaired datasets used to train, test, and evaluate the 650M parameter models including CurrAb.
classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Jaffe et al. and Phad et al.
Code: The code for model training and evaluation is available under the MIT license on GitHub.
创建时间:
2025-02-27



