five

Dual loop active learning of hydrophobicity of patterned SAMs

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.41ns1rnft
下载链接
链接失效反馈
官方服务:
资源简介:
Hydrophobic interactions drive numerous biological and synthetic processes. The materials used in these processes often possess chemically heterogeneous surfaces that are characterized by diverse chemical groups positioned in close proximity at the nanoscale; examples include functionalized nanomaterials and biomolecules like proteins and peptides. Nonadditive contributions to the hydrophobicity of such surfaces depend on the chemical identities and spatial patterns of polar and nonpolar groups in ways that remain poorly understood. Here, we develop a dual-loop active learning framework that combines a fast, reduced-accuracy method (a convolutional neural network) with a slow, higher-accuracy method (molecular dynamics simulations with enhanced sampling) to efficiently predict the hydration free energy, a thermodynamic descriptor of hydrophobicity, for nearly 200,000 chemically heterogeneous self-assembled monolayers (SAMs). Analysis of this data set reveals that SAMs with distinct polar groups exhibit substantial variations in hydrophobicity as a function of their composition and patterning, but the clustering of nonpolar groups is a common signature of highly hydrophobic patterns. Further MD analysis relates such clustering to the perturbation of interfacial water structure. These results provide new insight into the influence of chemical heterogeneity on hydrophobicity via quantitative analysis of a large set of surfaces, enabled by the active learning approach. Paper title: Identifying Nonadditive Contributions to the Hydrophobicity of Chemically Heterogeneous Surfaces via Dual-Loop Active Learning Authors: Atharva Kelkar, Bradley Dallin, Reid Van Lehn DOI: doi.org/10.1063/5.0072385 Methods This folder contains files to reproduce and analyze molecular dynamics (MD) trajectories of a large set of patterned self-assembled monolayers (SAMs), with patterns of a nonpolar and a polar group (either amine, amide, or hydroxyl). The dataset is split into 2 major parts - 1. trajectories - Tar files containing equilibrium and short production trajectories (GROMACS xtc files) for INDUS-labelled patterns from 3 different parts of the dual loop active learning algorithm method. This folder also contains initial configurations, topology files, mdp files, and CHARMM inputs needed to reproduce or extend trajectories. The 3 different parts of the dual loop active learning process are as follows -     a. Seed runs - Randomly-chosen patterns used to initiate the dual loop active learning process.     b. GPR runs - Patterns identified during the slow loop of the active learning loop     c. Max-dev runs - Patterns which were predicted to have the highest and lowest HFEs for a given polar area fraction, identified after the completion of training of the active learning loop Each trajectory folder contains a file titled "hfe_label.txt" which contains the calculated value of the HFE in units of kBT (with T = 300K). All simulations were performed using the force field files supplied in the top-level charmm36-jul2017.ff directory using the TIP4P/2005 water model and at constant volume and temperature (NVT). The name of the polar end group for each SAM is specified in the folder name. 2. 'collated_histograms.pickle' - Pickle containing a pre-processed dataset with 20x20 oxygen and hydrogen number density histograms corresponding to the trajectories in the 'trajectories/' folder. Each pickle file has the following data -     a. 'histograms' - Hydrogen and water density histograms (numpy arrays) of size [n_frames, 2, 400]     b. 'labels' - INDUS-calculated HFE labels for each of the histograms     c. 'ligand' - Ligands associated with each histograms     d. 'run_type' - Classification of category of runs (from point 1 above - Seed, GPR, or Max-dev)     e. 'folder_name' - Folder name of the trajectory associated with each histogram The objective of collated histograms is to enable scientists to load in a curated dataset with labels and histograms and apply data-centric tools to study the hydrophobicity of a large set of chemically heterogeneous surfaces with diverse end group chemistries. All the data needed to train the 3D CNN, i.e., idealized SAMs with amine, amide, and hydroxyl end groups, referenced in the paper have already been shared publicly with our previous publication (Kelkar, Dallin, and Van Lehn, J Phys Chem B 124 (41), 2020) at the following link: https://zenodo.org/record/4485912. All the codes required to generate results and analyze data using the dual-loop active learning algorithm, with trained GPR models, are uploaded to a git repo: https://gitlab.com/atharva-kelkar/dual-loop-active-learning
创建时间:
2022-02-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作