Project files provided as supporting information to the manuscript "Information-theoretical measures identify accurate low-resolution representations of protein configurational space"
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6554497
下载链接
链接失效反馈官方服务:
资源简介:
The dataset contains the following compressed folder:
-Notebooks.zip:
This folder contains:
-python_script:
-RESREL.py: script performing the clusterization and computing the relevance resolution curves
-random_curves.py: script generating the random value and computing the corresponding RES-REV curves_s
-Cluster_distance_matrix.py: script returning the distance among clusters for a given partition.
-python_notebook:
-Exploratory_analysis.ipynb: Analysis performed on the 12-protein_dataset
-DMAPS_ANTI.ipynb: Diffusion Map for the Antibody
-DMAPS_COV_1ake.ipynb: Diffusion Map + Inter-Intra state decomposition of covariance for 1ake
Packages required for the usage of these python scripts/notebooks:
-numpy
-pandas
-matplotlib
-seaborn
-multiprocessing
-scipy
========
RAW DATA
========
The raw data produced and employed in this study are available on a Google Drive folder at the following address:
https://drive.google.com/drive/folders/1PasAUCgpR5-gdzUVEdyusgZIayQN0Le9
In this folder, together with the compressed Notebooks.zip folder, one can fin the compressed folder Data.zip, within which the following data are present:
-12-protein_dataset:
-md.mdp: the .mdp file used in the MD simulations
-PROTEIN_PDB_CODE:
-Hk_{sel}.npy & Hs_{sel}.npy: the Rel & Res curves, sel=[all, CA, CB]
-RMSD_{sel}.npy: the RMSD matrix, sel=[all, CA, CB]
-npt.gro:protein+water+ions structure @TEO the equilibration (NVT+NPT)
-MSR_df.csv: a dataset containing the following columns
'area' : area behind the Relevance-Resolution curve;
'selection': the atomic selection (['all', 'CA', 'CB']) used to compute the RMSD matrix used for the clusterization (and consequently the Relevance-Resolution curves)
'method': the linkage measure used in the clustering procedure, an integer in [0,6];
'method_name': the linkage measure used in the clustering procedure, a string in ['average','ward','complete','single','centroid','median','weighted'];
'rmsd_mean': the mean value of the rmsd vector along the trajectory computed wrt the first frame;
'rmsd_var': the variance of the rmsd vector along the trajectory computed wrt the first frame;
'rgy_mean': the mean value of the radius of gyration along the trajectory;
'rgy_var': the variance of the radius of gyration along the trajectory;
'rmsf_mean': the mean value of the rmsf;
'rmsf_var': the variance of the rmsf;
'RMSD_M_mean': the mean value of the RMSD matrix.
'RMSD_M_var': the variance of the RMSD matrix.
-Random:
-curves.npy= 100K Relevance-Resolution Random curves for M=40001
-curves_s.npy= 100K Relevance-Resolution Random curves for M=15000
-validation_dataset:
-antibody:
-Hk_CB.npy & Hs_CB.npy: the Rel & Res curves
-RMSD_CB.npy: the RMSD matrix
-DIFF_{M}.npy: the eigenvalue/vector of the 10-D diffusion space
-Label_{method}.npy: the label vector for n_clusters
-1ake:
-Hk_{sel}.npy & Hs_{sel}.npy: the Rel & Res curves
-RMSD_{sel}.npy: the RMSD matrix
-DIFF_{M}.npy: the eigenvalue/vector of the 10-D diffusion space
-Label_{method}.npy: the label vector for n_clusters
-intra_{m}.npy: the intra-cluster covariance matrix
-inter_cov_{m}.npy: the inter-cluster correlation matrix
NOTE
=====
The matrices of the cluster distances for adenylate kinase and antibody have been computed through the script Cluster_distance_matrix.py.
These matrices have not been included in the dataset because of their large size; the raw data are however available upon request.
创建时间:
2022-05-17



