Large-scale discovery, analysis, and design of protein energy landscapes

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14983480

下载链接

链接失效反馈

官方服务：

资源简介：

*** IMPORTANT! Please Register to use of these data so that we can continue to release new useful datasets! This will take 10 seconds!! ***This repository contains datasets generated for our study on protein energy landscapes using our multiplex hydrogen-deuterium exchange (mHDX) analysis. The datasets include raw and processed HDX data, NMR results, curated subsets, and machine learning splits with interpretable and deep learning-derived features. These resources support various analyses, including protein stability assessment, EX1 kinetics evaluation, and predictive modeling. Available Datasets: Dataset_0_InitialOrder: Initial DNA sequences from all libraries (15,715 unique sequences). Dataset_1_UnfilteredData: Minimally filtered HDX data based on confident identifications and PO score < 50 (8,293 unique sequences). Dataset_2_SuccessfulHDX: Proteins passing quality control metrics, including EX1 kinetics (5,778 unique sequences). Dataset_3_MeasurablyStable: Proteins reaching full deuteration with ΔGunfold > 2 kcal/mol and passing EX1 kinetics filter (3,590 unique sequences). Dataset_4_HDXNMR: HDX-NMR results per condition, including average ΔGopen per position (16 unique sequences). Dataset_5_MesophilicThermophilic: Subset of proteins from natural domains classified as mesophilic or thermophilic based on optimal growth temperature (>40°C) (1,637 unique sequences). Dataset_6_splits_interpretable: Machine learning splits with interpretable features (3,193 unique sequences). Dataset_6_splits_esm2: Machine learning splits with ESM2-derived features (3,465 unique sequences). Dataset_6_splits_unirep: Machine learning splits with Unirep-derived features (3,465 unique sequences). Dataset_6_splits_saprot: Machine learning splits with SaProt-derived features (3,465 unique sequences). Dataset_7_mHDX_cDNA: Subset of Dataset_2 (best PO scored candidate, EX1 kinetics excluded) overlapping with cDNA proteolysis assay data from Tsuboyama et al. (2023) (4,464 unique sequences). Dataset_8_PDFs: Comprehensive plots generated using the mhdx_pipeline and hdxrate_pipeline, visualizing time-dependent mass distributions and fits to exchange rates. A Jupyter notebook is included to facilitate navigation. (Note: This dataset is split into eight parts for uploading purposes — .zip_part_aa through .zip_part_ah. Please concatenate the parts before unzipping.) Dataset_9_AlphaFoldModels: AlphaFold 2 models/Rosetta relaxed from Dataset_2_SucessfulHDX (5,778 unique sequences)

创建时间：

2025-03-24