five

A curriculum learning approach to training antibody language models

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14661301
下载链接
链接失效反馈
官方服务:
资源简介:
Motivation. There is growing interest in antibody language models (AbLMs) being pre-trained with a mixture of unpaired and natively paired sequencing data. This is due to the proven benefits of pre-training with natively paired sequences, but their relative sparity compared to unpaired data. Existing models trained with unpaired and paired data typically use a finetuning approach to pre-training, but this can result in catastrophic forgetting of the unpaired sequences. We explore if a curriculum learning approach to pre-training could help address these issues. Results. We introduce a modified version of curriculum learning for training AbLMs, to modify the sampling of training data throughout training. This results in a gradual transition from unpaired to paired data throughout training. We optimize this approach and observe that our 650M-parameter curriculum model, CurrAb, outperforms existing AbLMs in downstream classification tasks. Files. The following files are included in this repository: CurrAb.tar.gz: Model weights for the CurrAb model. Model can also be downloaded from HuggingFace. TTE_paired-downsampled.tar.gz: Downsampled paired datasets used to train, test, and evaluate the 55M parameter models. TTE_unpaired-downsampled.tar.gz: Downsampled unpaired datasets used to train, test, and evaluate the 55M parameter models. TTE_paired-full.tar.gz: Full paired datasets used to train, test, and evaluate the 650M parameter models including CurrAb. TTE_unpaired-full.tar.gz: Full unpaired datasets used to train, test, and evaluate the 650M parameter models including CurrAb. classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Jaffe et al. and Phad et al. Code: The code for model training and evaluation is available under the MIT license on GitHub.
创建时间:
2025-02-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作