Improving antibody language models with native pairing

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/8237395

下载链接

链接失效反馈

官方服务：

资源简介：

Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance. Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features. Files. The following files are included in this repository: BALM-paired.tar.gz: Model weights for the BALM-paired model. BALM-shuffled.tar.gz: Model weights for the BALM-shuffled model. BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model. ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences. jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format. test-dataset_annotated.tar.gz: Two csv files, both with sequences annotated in an AIRR-compliant format. lc-coherence_test-unique_annotated.csv contains all sequences from the test dataset and fig3-20kembeddings_annotated.csv contains the 20k sequences from the test used for the Figure 2 UMAP embeddings. For both datasets, the sequences can be paired together based on their pair_id. train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant. train-test-eval_shuffled.tar.gz: Datasets used to train, test, and evaluate the BALM-shuffled model. Compressed folder containing three csv files, with two columns for the heavy and light chains. train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Hurtado et al. Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub. An archived version of the GitHub repository (from the time of manuscript publication) is included here as code-archive.zip.

创建时间：

2024-07-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集