Datasets for AGIMA-Score modeling - upd

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13352485

下载链接

链接失效反馈

官方服务：

资源简介：

This data repositary includes the following datasets. (1) 'training.zip' -- It is a secondary dataset that originates from the Refined Set in PDBbind database (version V2020). When training models like AGIMA-Score, the complexes in the validation/test sets need to be removed. 5007 complexes are included. (2) 'validation.zip' -- It is a secondary dataset that originates from the Core Set (CASF-2016) in PDBbind database (version V2020). It was used as the validation set (for parameter tuning) when building the AGIMA-Score models. Complexes that are similar to those in the training set (protein sequence similarity > 0.3 and ligand similarity > 0.7) were removed. 195 complexes are included. (3) 'test1.zip' -- It is a secondary dataset that originates from CSAR-HiQ1. It was used as the Test1 set for evaluating the AGIMA-Score models. Complexes that are similar to those in the training/validation sets (protein sequence similarity > 0.3 and ligand similarity > 0.7) were removed. 116 complexes are included. (4) 'test2.zip' -- It is a secondary dataset that originates from CSAR-HiQ2. It was used as the Test2 set for evaluating the AGIMA-Score models. Complexes that are similar to those in the training/validation/Test1 sets (protein sequence similarity > 0.3 and ligand similarity > 0.7) were removed. 102 complexes are included. (5) 'indexes.zip' -- It includes the labels (binding affinity data) for the complexes in above (1)~(4) sets. A file 'xxxx_atm_prop.txt' indicates a specific protein-ligand complex in above sets, with 'xxxx' denoting the original complex ID in PDBbind and the data fields showing the following information. Note that here each row in such as file indicates an atom in the binding complex. --------------------------------------------------------------------------------id - atom id with protein atoms starting from 1 and ligand atoms also starting from 1 (integer) atmnum - atomic number (integer) x,y,z - the X, Y, Z coordinates for the atom (float) atmB,atmC,atmN,atmO,atmP,atmS,atmSe - whether the atom is of some specific type, such as B, C, N, O, P, S and Se (binary) atmHalogen - whether the atom is a halogen atom (binary) atmMetal,atmMetallic - whether the atom is metal (binary) hybridization - hybridization type of the atom (integer) heavyneighbors - number of heavy-atom neighbors (integer) heteroneighbors - number of hetero-atom neighbors (integer) hydrophobic,aromatic,acceptor,donor,ring - pharmacophoric properties of the atom (binary) partialCH - paricial charge of the atom (float) posionizable,negionizable - whether the atom is positively ionizable or negatively ionizable (binary) exlvolume - excluded volume of the atom (float) vdwrad - VDW radius of the atom (float) moltype - molecule the atom belongs to (0 for protein and 1 for ligand) "neighbors(nbr:idx--anum--(sbond,dbond,tbond,arombond,ringbond))" - information of the covalent neighboring atoms for the atom-------------------------------------------------------------------------------- (6) 'trained_AGIMAscore18.zip' -- It includes the AGIMA-Score18 model that was trained using above training and validation (parameter tuning) data. The model was saved in Keras format and compressed to a ZIP file. (7) 'predictions_byAGIMAscore18.zip' -- It includes the predictions generated by the AGIMA-Score18 model for the validation and test sets. Three files ('predictions_validation.csv', 'predictions_test1.csv', and 'predictions_test2.csv',) are included.

本数据集仓库包含以下数据集。 (1) "training.zip"——其为源自PDBbind数据库（PDBbind database，版本V2020）中精修集（Refined Set）的二级数据集。在训练AGIMA-Score等模型时，需移除验证集/测试集中的复合物。该数据集共包含5007个复合物。 (2) "validation.zip"——其为源自PDBbind数据库（PDBbind database，版本V2020）中核心集（Core Set，CASF-2016）的二级数据集。构建AGIMA-Score模型时，该数据集被用作验证集（用于参数调优）。已移除与训练集中复合物相似（蛋白质序列相似度>0.3且配体相似度>0.7）的条目，共包含195个复合物。 (3) "test1.zip"——其为源自CSAR-HiQ1的二级数据集。该数据集被用作AGIMA-Score模型评估的Test1集。已移除与训练/验证集中复合物相似（蛋白质序列相似度>0.3且配体相似度>0.7）的条目，共包含116个复合物。 (4) "test2.zip"——其为源自CSAR-HiQ2的二级数据集。该数据集被用作AGIMA-Score模型评估的Test2集。已移除与训练/验证/Test1集中复合物相似（蛋白质序列相似度>0.3且配体相似度>0.7）的条目，共包含102个复合物。 (5) "indexes.zip"——包含上述(1)~(4)中所有复合物的标签（结合亲和力数据）。文件"xxxx_atm_prop.txt"用于指代上述数据集中的某一蛋白质-配体复合物，其中"xxxx"代表PDBbind数据库中的原始复合物ID，数据字段包含以下信息。需注意，此类文件的每一行分别对应结合复合物中的一个原子。 -------------------------------------------------------------------------------- id：原子ID，蛋白质原子编号从1开始，配体原子编号同样从1开始（整数型） atmnum：原子序数（整数型） x,y,z：原子的X、Y、Z坐标（浮点型） atmB、atmC、atmN、atmO、atmP、atmS、atmSe：标记原子是否为B、C、N、O、P、S、Se等特定元素类型（二值型） atmHalogen：标记原子是否为卤素原子（二值型） atmMetal、atmMetallic：标记原子是否为金属原子（二值型） hybridization：原子的杂化类型（整数型） heavyneighbors：重原子邻接数（整数型） heteroneighbors：杂原子邻接数（整数型） hydrophobic、aromatic、acceptor、donor、ring：原子的药效团属性（二值型） partialCH：原子的部分电荷（浮点型） posionizable、negionizable：标记原子是否为可正电离或可负电离位点（二值型） exlvolume：原子的排除体积（浮点型） vdwrad：原子的范德华半径（浮点型） moltype：原子所属的分子类型（0代表蛋白质，1代表配体） "neighbors(nbr:idx--anum--(sbond,dbond,tbond,arombond,ringbond))"：该原子的共价邻接原子信息，格式为nbr:索引--原子序数--(单键、双键、三键、芳香键、环键) -------------------------------------------------------------------------------- (6) "trained_AGIMAscore18.zip"——包含利用上述训练集与验证集（用于参数调优）训练得到的AGIMA-Score18模型。该模型以Keras格式保存并压缩为ZIP文件。 (7) "predictions_byAGIMAscore18.zip"——包含AGIMA-Score18模型针对验证集与测试集生成的预测结果。其中包含三个文件："predictions_validation.csv"、"predictions_test1.csv"以及"predictions_test2.csv"。

创建时间：

2025-04-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集