five

Random forest algorithm reveals novel sites in HA protein that shift receptor binding preference of the H9N2 avian influenza virus

收藏
DataCite Commons2025-04-27 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=2455ffceffa04afca6c72c51f435fd1b
下载链接
链接失效反馈
官方服务:
资源简介:
After filtering out redundant, incomplete, environmental source, and unclear information sequences, a total of 5,656 H9N2 HA gene sequences were obtained from the National Center for Biotechnology Information (NCBI), Global Initiative on Sharing All Influenza Data (GISAID), and Influenza Research Database (IRD) databases. HA sequences were divided into two datasets based on host information: avian-derived sequences (5,588) and mammal-derived sequences (68) and labeled accordingly. Amino acid types were replaced with numerical values to transform the amino acid sequences into machine-readable vectors (Supplementary Table S1). We used the Random Forest Classifier method in Sklearn (1.0.22) to train the random forest classifier (Abraham et al., 2014), and in consideration of the large difference in the numbers of avian and non-avian sequences, we selected a balanced number of samples for training. The specific training parameters were: random _state=0,n _estimators=1500,oob _score=True,n _jobs=-1,class _weight='balanced'. During the model performance evaluation process, five-fold cross-validation was used to examine the classification performance of the random forest model, with the area under the ROC curve (AUC) used as the evaluation metric. Random under-sampling was applied to the avian-derived data during the training process. In the feature selection process, all data were used to train the random forest classifier and perform feature selection. After training, we extracted weight information for each site to represent its importance.
提供机构:
Science Data Bank
创建时间:
2024-12-25
二维码
社区交流群
二维码
科研交流群
商业服务