Focused learning by antibody language models using preferential masking of non-templated regions

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13973759

下载链接

链接失效反馈

官方服务：

资源简介：

Motivation. While existing antibody language models (AbLMs) excel at predicting germline residues, they often struggle with mutated and non-templated residues, which concentrate in the complementarity-determining regions (CDRs) and are crucial for determining antigen-binding specificity. Many of these models are trained using a masked language modeling (MLM) objective with uniform masking probabilities; however, antibody recombination is modular in nature, creating relatively distinct regions of high and low complexity (non-templated and templated, respectively). We sought to determine whether and to what extent AbLMs can improve when trained using an alternative masking strategy based on this observation. Results. We developed a variation on MLM called Preferential Masking, which alters masking probabilities to amplify training signals from the CDR3. We pre-trained two AbLMs using either uniform or preferential masking and observed that the latter improves pre-training efficiency and residue prediction accuracy in the highly variable CDR3. Preferential masking also improves antibody classification by native chain pairing and binding specificity, suggesting improved CDR3 understanding and indicating that non-random, learnable patterns help govern antibody chain pairing. We further show that specificity classification is largely informed by residues in the CDRs, demonstrating that AbLMs learn meaningful patterns that align with immunological understanding. Files. The following files are included in this repository: uniform_250k.tar.gz: Model weights for the Uniform-250k model. uniform_350k.tar.gz: Model weights for the Uniform-350k model. preferential_250k.tar.gz: Model weights for the Preferential-250k model. train-eval-test_cdr-mask.tar.gz: Datasets used to train all three models above. Compressed folder containing three files: A_train.csv, A_eval.csv, and B_test.csv. Each row contains a natively paired sequence with its corresponding label-encoded CDR mask, designed to align with the tokenized amino acid sequence. Sequences were obtained from Jaffe et al. and Hurtado et al. These are referenced in the paper as Dataset A (A_train.csv, A_eval.csv), and Dataset B (B_test.csv). test-set_annotations.tar.gz: Unpaired annotations for all test set (Dataset B) sequences: B_test-set_annotations.csv. Used for Fig. 3 and Fig. 4D. Annotations can be mapped back to the paired sequences using their `sequence_id` and `locus` information. pair_classification.tar.gz: Two classification datasets used to train the classifier models in Figure 4: C_native-0_shuffled-1.csv (Dataset C) and D_native-0_shuffled-1.csv (Dataset D). Dataset C sequences were obtained from Jaffe et al. and Hurtado et al (Dataset B), and Dataset D sequences were obtained from Phad et al and data generated as part of this study. CoV_classification.tar.gz: Classification dataset used to train the classifier models in Figure 5: E_hd-0_cov-1.csv (Dataset E). CoV antibody sequences were obtained from CoV-AbDAb, and healthy donor sequences were obtained from Phad et al. Code. All code used for model training, testing, and figure generation is available under the MIT license on GitHub.

创建时间：

2024-11-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集