Revised MD17 dataset (rMD17)
收藏DataCite Commons2025-05-01 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/Revised_MD17_dataset_rMD17_/12672038/2
下载链接
链接失效反馈官方服务:
资源简介:
THE REVISED MD17 dataset:<br>=========================<br><br><br>Citation:<br>========<br><br> Anders S. Christensen and O. Anatole von Lilienfeld (2020) "On the role of gradients for machine learning of molecular energies and forces" arXiv:?????<br><br>The molecules are taken from the original MD17 dataset by Chmiela et al., and 100,000 structures are taken, and the energies and forces are recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. As such, the dataset is practically free from nummerical noise. <br><br><br>One warning: As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the autocorrelation function for the original MD17 dataset<br><br>In short: DO NOT train a model on more than 1000 samples from this dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.<br><br><br>The data:<br>=========<br><br>The ten molecules are save in Numpy .npz format.<br><br>The keys correspond to:<br><br>'nuclear_charges' : The nuclear charges for the molecule<br>'coords' : The coordinates for each conformation (in units of ångstrom)<br>'energies' : The total energy of each conformation (in units of kcal/mol)<br>'forces' : The cartesian forces of each conformation (in units of kcal/mol/ångstrom)<br>'old_indices' : The index of each conformation in the original MD17 dataset<br>'old_energies' : The energy of each conformation taken from the original MD17 dataset (in units of kcal/mol)<br>'old_forces' : The forces of each conformation taken from the original MD17 dataset (in units of kcal/mol/ångstrom)<br><br>*Note that for Azobenzene, only 99988 samples are available due to 11 failed DFT calculations, and the original dataset only contained 99999 structures.<br><br><br>Data splits:<br>============<br>Five training and test splits are saved in CSV format containing the corresponding indices.<br><br>
修订版MD17数据集:<br>=========================<br><br><br>引用:<br>========<br><br>Anders S. Christensen 与 O. Anatole von Lilienfeld(2020)发表的《On the role of gradients for machine learning of molecular energies and forces》,arXiv:?????<br><br><br>本数据集的分子取自Chmiela等人构建的原始MD17数据集,共提取100000个分子构象,并采用极严格的自洽场(Self-Consistent Field,SCF)收敛标准与极密的密度泛函理论(Density Functional Theory,DFT)积分网格,在PBE/def2-SVP理论级别下重新计算了分子能量与原子受力。因此,该数据集几乎不存在数值噪声。<br><br><br>重要提示:由于构象取自分子动力学模拟(即时间序列数据),无法保证各样本间相互独立。这一点可从原始MD17数据集的自相关函数中直观体现。<br><br><br>简言之:请勿使用该数据集内超过1000个样本训练模型。鉴于上述问题,以及原始MD17数据集本身存在的噪声,已发表的基于原始MD17数据集50000个样本的相关研究结论均不具备参考价值。<br><br><br>数据集内容:<br>=========<br><br>共包含10种分子,数据以Numpy的.npz格式存储。<br><br>各数据字段的含义如下:<br><br>'nuclear_charges':分子的核电荷数<br>'coords':各分子构象的坐标(单位:埃(ångstrom,符号Å))<br>'energies':各构象的总能量(单位:千卡每摩尔(kcal/mol))<br>'forces':各构象的笛卡尔坐标系下原子受力(单位:千卡每摩尔·埃(kcal/mol/ångstrom))<br>'old_indices':各构象在原始MD17数据集中的索引<br>'old_energies':各构象取自原始MD17数据集的能量值(单位:千卡每摩尔(kcal/mol))<br>'old_forces':各构象取自原始MD17数据集的原子受力(单位:千卡每摩尔·埃(kcal/mol/ångstrom))<br><br>*注:偶氮苯(Azobenzene)因11次密度泛函理论计算失败,仅可获取99988个样本,且原始MD17数据集本身仅包含99999个构象。<br><br><br>数据划分:<br>============<br><br>共提供5组训练集与测试集划分,以逗号分隔值(Comma-Separated Values,CSV)格式存储,其中包含对应构象的索引。
提供机构:
figshare
创建时间:
2020-07-18
搜集汇总
数据集介绍

背景与挑战
背景概述
Revised MD17 dataset (rMD17)是一个修订版的分子动力学数据集,包含10个分子的100,000个结构,能量和力在PBE/def2-SVP理论水平下重新计算,具有高精度。数据集以Numpy格式存储,包含核电荷、坐标、能量和力等信息,但样本之间存在依赖性,建议训练时使用不超过1000个样本。
以上内容由遇见数据集搜集并总结生成



