Trained checkpoints and preprocessed data for "Closing the gap on a $0 budget: ensembling public molecular foundation models for HIV bioactivity prediction"
收藏DataCite Commons2026-05-03 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19946459
下载链接
链接失效反馈官方服务:
资源简介:
Trained model checkpoints, normalization statistics, fitted ensemble stacker,and preprocessed graph cache supporting the HIV bioactivity predictionpreprint by Agarwal (2026).
Contents:- best_molformer_fold{0..4}.pth: Five MolFormer-XL checkpoints fine-tuned on MoleculeNet HIV scaffold-CV folds. Each ~170 MB.- best_gnn_fold{0..4}_v5_desc.pth: Five GATv2-based GNN ("v5b") checkpoints trained from scratch on the same folds.- global_feature_stats_v5_desc_fold{0..4}.pt: Per-fold means/stds for the RDKit global descriptors (z-score normalization).- ensemble_stacker.pt: Logistic stacker coefficients, three principled decision thresholds (Youden's J / F1-max / base-rate), and raw out-of- fold prediction arrays for n=24,391 molecules.- hiv_preprocessed_cache_v5_desc.pt: 41,119 RDKit-parsed molecules as PyTorch Geometric Data objects with atom features (23-dim), bond features (8-dim), global descriptors, and Bemis-Murcko scaffolds. Reproduces the exact deterministic 5-fold scaffold split used in training.
These artifacts reproduce the headline test AUC of 0.806 ± 0.018 on theMoleculeNet HIV scaffold-split benchmark. Source code is athttps://github.com/v659/HIV-drug-discovery.
License: MIT (matches the source repository).
提供机构:
Zenodo
创建时间:
2026-05-01



