Training and test data for antibody humanness evaluation

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10530384

下载链接

链接失效反馈

官方服务：

资源简介：

### Training and test data for humanness evaluation This data was collected in conjunction with and used fortraining and testing for Parkinson / Wang et al 2024. Thedata is organized as follows: - Heavy chain training and multispecies test data (under the heavy chain folder) - The conslidated cAb rep file contains training human sequences - The test sample sequences folder contains fasta files with test sequences for each species- Light chain training and multispecies test data (under the light chain folder) - The conslidated cAb rep file contains training human sequences - The test sample sequences folder contains fasta files with test sequences for each species- Abybank data (under the abybank compiled data folder) - This folder contains separate folders for heavy and light chain - Each subfolder contains test data for a more diverse species set under fasta files for each species- Humanization test data (under the humanization test data folder) - The sequences in the parental.fa file were originally humanized as part of drug discovery programs - The experimental.fa file contains the humanization results- IMGT and ADA data (under the imgt test data folder) - The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB - The thera ada fa file contains sequences evaluated in the clinic - The Therapeutic ADA txt file contains anti drug antibody results for those antibodies- VDJ statistics (under the vdj_statistics_eval folder) The data was retrieved from the following sources. 1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/) The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set. The IMGT, ADA and humanization test data was retrieved from Prihoda et al. andthe associated [Github repo](https://github.com/Merck/BioPhi-2021-publication). See Parkinson et al. 2024 and the associated github repos for more details on how models other thanSAM / AntPack were evaluated on this data.

创建时间：

2024-03-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集