Revised MD17 dataset (rMD17)
收藏DataCite Commons2026-02-28 更新2024-08-17 收录
下载链接:
https://figshare.com/articles/dataset/Revised_MD17_dataset_rMD17_/12672038/3
下载链接
链接失效反馈官方服务:
资源简介:
THE REVISED MD17 dataset:<br>=========================<br><br><br>Citation:<br>========<br><br> Anders S. Christensen and O. Anatole von Lilienfeld (2020) "On the role of gradients for machine learning of molecular energies and forces" https://arxiv.org/abs/2007.09593<br><br>The molecules are taken from the original MD17 dataset by Chmiela et al., and 100,000 structures are taken, and the energies and forces are recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. As such, the dataset is practically free from nummerical noise. <br><br><br>One warning: As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the autocorrelation function for the original MD17 dataset<br><br>In short: DO NOT train a model on more than 1000 samples from this dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.<br><br><br>The data:<br>=========<br><br>The ten molecules are save in Numpy .npz format.<br><br>The keys correspond to:<br><br>'nuclear_charges' : The nuclear charges for the molecule<br>'coords' : The coordinates for each conformation (in units of ångstrom)<br>'energies' : The total energy of each conformation (in units of kcal/mol)<br>'forces' : The cartesian forces of each conformation (in units of kcal/mol/ångstrom)<br>'old_indices' : The index of each conformation in the original MD17 dataset<br>'old_energies' : The energy of each conformation taken from the original MD17 dataset (in units of kcal/mol)<br>'old_forces' : The forces of each conformation taken from the original MD17 dataset (in units of kcal/mol/ångstrom)<br><br>*Note that for Azobenzene, only 99988 samples are available due to 11 failed DFT calculations, and the original dataset only contained 99999 structures.<br><br><br>Data splits:<br>============<br>Five training and test splits are saved in CSV format containing the corresponding indices.<br><br>
修订版MD17数据集:
=========================
引用:
========
Anders S. Christensen与O. Anatole von Lilienfeld(2020)发表论文《论梯度在分子能量与力的机器学习中的作用》,原文链接:https://arxiv.org/abs/2007.09593
本数据集的分子取自Chmiela等人构建的原始MD17数据集,共选取100,000个分子构象,并采用PBE/def2-SVP理论级别重新计算了分子能量与受力:计算过程采用了极严格的自洽场(SCF, Self-Consistent Field)收敛标准与极细密的密度泛函理论(DFT, Density Functional Theory)积分网格,因此本数据集几乎无数值噪声。
一则注意事项:由于构象源自分子动力学模拟(即时序数据),无法保证各样本间相互独立,这一点可从原始MD17数据集的自相关函数中直观体现。
简言之:请勿使用该数据集内超过1000个样本训练模型。已发表的基于原始MD17数据集50,000样本的研究成果,鉴于上述样本独立性问题与原始数据存在的噪声,应视为无参考价值。
数据内容:
=========
本数据集包含10种分子,以NumPy(Numerical Python)的.npz格式存储。
各数据字段含义如下:
'nuclear_charges':分子的核电荷数
'coords':各构象的坐标(单位:ångstrom,即埃)
'energies':各构象的总能量(单位:kcal/mol,即千卡/摩尔)
'forces':各构象的笛卡尔坐标下的受力(单位:kcal/mol/ångstrom,即千卡/摩尔/埃)
'old_indices':各构象在原始MD17数据集中的索引
'old_energies':从原始MD17数据集中提取的各构象能量(单位:kcal/mol,即千卡/摩尔)
'old_forces':从原始MD17数据集中提取的各构象受力(单位:kcal/mol/ångstrom,即千卡/摩尔/埃)
*注:偶氮苯(Azobenzene)因11次密度泛函理论计算失败,仅包含99,988个样本,且原始数据集仅含99,999个构象。
数据划分:
============
数据集内置5组训练集与测试集划分,以CSV格式存储,内含对应样本的索引。
提供机构:
figshare
创建时间:
2020-07-21
搜集汇总
数据集介绍

背景与挑战
背景概述
Revised MD17 dataset (rMD17) 是一个修订版分子动力学数据集,基于原始MD17数据集,包含10个分子的100,000个结构,能量和力在PBE/def2-SVP理论水平下重新计算,以降低数值噪声。该数据集来自分子动力学模拟,样本非独立,因此建议训练时使用不超过1000个样本,以避免过拟合问题,数据以Numpy .npz格式提供,并包含训练和测试分割。
以上内容由遇见数据集搜集并总结生成



