five

Dataset of Implicit Solvation Protein Energies and Forces

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13755809
下载链接
链接失效反馈
官方服务:
资源简介:
Most GNNs are designed and benchmarked to produce accurate predictions for small molecule datasets. Few datasets containing large, biologically-relevant proteins have been constructed. Our custom Dataset of Implicit Solvation Protein Energies and Forces contains over 200,000 proteins ranging in size from 16 to 1,022 amino acids, along with their implicit solvation free energies (a many-body energy term) and corresponding forces. DISPEF enables evaluation and future design of GNNs for large, biologically-relevant proteins. Four main subsets of DISPEF have been constructed: DISPEF-S - this consists of 81,341 proteins with less than 2,000 atoms. This subset was further split into training and testing sets consisting of 65,072 and 16,269 proteins, respectively. Training and testing sets are denoted as "tr" and "te", respectively. DISPEF-M - this consists of 24,000 proteins with less than 400 amino acids (∼ 6,800 atoms). This subset was further split into training and testing sets consisting of 19,200 and 4,800 proteins, respectively. Training and testing sets are denoted as "tr" and "te", respectively. DISPEF-L - this consists of 109,108 proteins with greater than 6,800 atoms and less than 12,500 atoms. DISPEF-c - this consists of 560 randomly selected proteins, spaced roughly evenly in size from 192 to 12,346 atoms. DISPEF-S aims to assess the ability of GNNs to produce accurate predictions across a large quantity of relatively smaller proteins, while DISPEF-M aims to assess the scalability of GNNs across a smaller set of different sized proteins. DISPEF-L aims to assess the transferability of GNNs to structures significantly larger than those in the training set. While DISPEF-L could also be used to train GNNs, this would likely require a GPU with greater than 32 GB of memory. Lastly, the smaller DISPEF-c was constructed to evaluate the computational cost of GNNs. Overall, the construction of DISPEF enables comprehensive investigation into the accuracy, scalability, and transferability of GNN architectures to large, biologically-relevant proteins.
创建时间:
2025-03-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作