A curriculum learning approach to training antibody language models

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14661301

下载链接

链接失效反馈

官方服务：

资源简介：

Motivation. There is growing interest in antibody language models (AbLMs) being pre-trained with a mixture of unpaired and natively paired sequencing data. This is due to the proven benefits of pre-training with natively paired sequences, but their relative sparity compared to unpaired data. Existing models trained with unpaired and paired data typically use a finetuning approach to pre-training, but this can result in catastrophic forgetting of the unpaired sequences. We explore if a curriculum learning approach to pre-training could help address these issues. Results. We introduce a modified version of curriculum learning for training AbLMs, to modify the sampling of training data throughout training. This results in a gradual transition from unpaired to paired data throughout training. We optimize this approach and observe that our 650M-parameter curriculum model, CurrAb, outperforms existing AbLMs in downstream classification tasks. Files. The following files are included in this repository: CurrAb.tar.gz: Model weights for the CurrAb model. Model can also be downloaded from HuggingFace. TTE_paired-downsampled.tar.gz: Downsampled paired datasets used to train, test, and evaluate the 55M parameter models. TTE_unpaired-downsampled.tar.gz: Downsampled unpaired datasets used to train, test, and evaluate the 55M parameter models. TTE_paired-full.tar.gz: Full paired datasets used to train, test, and evaluate the 650M parameter models including CurrAb. TTE_unpaired-full.tar.gz: Full unpaired datasets used to train, test, and evaluate the 650M parameter models including CurrAb. classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Jaffe et al. and Phad et al. Code: The code for model training and evaluation is available under the MIT license on GitHub.

创建时间：

2025-02-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集