five

Training and test data for antibody humanness evaluation

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10530384
下载链接
链接失效反馈
官方服务:
资源简介:
### Training and test data for humanness evaluation This data was collected in conjunction with and used fortraining and testing for Parkinson / Wang et al 2024. Thedata is organized as follows: - Heavy chain training and multispecies test data (under the heavy chain folder)    - The conslidated cAb rep file contains training human sequences    - The test sample sequences folder contains fasta files with test sequences for each species- Light chain training and multispecies test data (under the light chain folder)    - The conslidated cAb rep file contains training human sequences    - The test sample sequences folder contains fasta files with test sequences for each species- Abybank data (under the abybank compiled data folder)    - This folder contains separate folders for heavy and light chain    - Each subfolder contains test data for a more diverse species set under fasta files for each species- Humanization test data (under the humanization test data folder)    - The sequences in the parental.fa file were originally humanized as part of drug discovery programs    - The experimental.fa file contains the humanization results- IMGT and ADA data (under the imgt test data folder)    - The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB    - The thera ada fa file contains sequences evaluated in the clinic    - The Therapeutic ADA txt file contains anti drug antibody results for those antibodies- VDJ statistics (under the vdj_statistics_eval folder) The data was retrieved from the following sources. 1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/) The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set. The IMGT, ADA and humanization test data was retrieved from Prihoda et al. andthe associated [Github repo](https://github.com/Merck/BioPhi-2021-publication). See Parkinson et al. 2024 and the associated github repos for more details on how models other thanSAM / AntPack were evaluated on this data.
创建时间:
2024-03-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作