Materials Project Trajectory (MPtrj) Dataset
收藏DataCite Commons2025-06-01 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Materials_Project_Trjectory_MPtrj_Dataset/23713842/2
下载链接
链接失效反馈官方服务:
资源简介:
<br>This data file is the MPtrj dataset.The json file contains 1,580,395 structures, 1,580,395 energies, 7,944,833 magnetic moments, 49,295,660 forces, and 14,223,555 stresses that were used to train the pretrained CHGNetThe structures and labels are parsed from all the GGA/GGA+U static/relaxation trajectories from 2022.9 version Materials Project, with selection method that avoids imcompatible calculations and duplicated structures.The format of the json file looks like this:MPtrj-'mp-id-0'-'frame-id-0'-'structure': dictionary of pymatgen.core.Structure-'uncorrected_total_energy': [eV] raw energy from VASP output-'corrected_total_energy': [eV] VASP total energy after MP2020 compatibility-'energy_per_atom': [eV/atom] corrected energy per atom, this is the energy label used to train CHGNet-'ef_per_atom': [eV/atom] formation energy per atom-'e_per_atom_relaxed': [eV/atom] corrected energy per atom of the relaxed structure, this is the energy you can find for the mp-id on materials project website-'ef_per_atom_relaxed': [eV/atom] formation energy per atom of the relaxed structure-'force': [eV/A] force on the atoms-'stress': [kBar] stress on the cell-'magmom': [muB] magmom on the atoms-'bandgap': [eV] bandgap-'frame-id-1'...-'mp-id-1'...Notes:1. The frame id has syntax: 'task_id-calc_id-ionic_step', where 'calc_id' is 0 (second) or 1 (first) in the double relaxation process for each material project relaxation task.2. Since MPtrj is a diverse dataset that contains both GGA and GGA+U calculation, which has different energy values, MP2020 compatibility is applied to the VASP raw energies to make GGA and GGA+U universally compatible. The 'energy_per_atom' (which is after MP2020 correction) is used for pretrained CHGNet training.see: https://pymatgen.org/pymatgen.entries.html#pymatgen.entries.compatibility.Compatibility3. There're missing MAGMOMs labels in the MPtrj, which we put None as labels. These do not mean the MAGMOM label is 0. CHGNet is trained on absolute value of DFT magmom, which is the absolute value of the labels contained in MPtrj, the unit conversion is automatic if you use the dataset we provide, see: https://github.com/CederGroupHub/chgnet/blob/main/chgnet/data/dataset.py4. The stress values in MPtrj json are raw stress values from VASP. CHGNet output stress is in unit of GPa, which is -0.1 * the VASP raw stress in MPtrj dataset. The unit conversion is also implemented in CHGNet dataset, so you don't have to convert the VASP stress unit when passing them to the dataset object.Reference:If you use CHGNet or MPtrj dataset, please cite:@article{deng_2023_chgnet,title={CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling},DOI={10.1038/s42256-023-00716-3},journal={Nature Machine Intelligence},author={Deng, Bowen and Zhong, Peichen and Jun, KyuJung and Riebesell, Janosh and Han, Kevin and Bartel, Christopher J. and Ceder, Gerbrand},year={2023},pages={1–11}}
本数据文件为MPtrj数据集。该JSON文件包含1,580,395个结构、1,580,395个能量值、7,944,833个磁矩、49,295,660个原子受力以及14,223,555个晶胞应力,上述数据均用于预训练CHGNet模型。
所有结构与标签均从2022.9版本材料项目(Materials Project)的GGA/GGA+U静态/弛豫轨迹中解析得到,筛选过程中规避了不兼容计算与重复结构。
该JSON文件的格式示例如下:
MPtrj-'mp-id-0'-'frame-id-0'-'structure': pymatgen.core.Structure(pymatgen核心结构类)的字典
-'uncorrected_total_energy': [eV] VASP输出的原始总能量
-'corrected_total_energy': [eV] 经MP2020兼容性校正后的VASP总能量
-'energy_per_atom': [eV/atom] 每原子校正后能量,此为用于训练CHGNet的能量标签
-'ef_per_atom': [eV/atom] 每原子形成能
-'e_per_atom_relaxed': [eV/atom] 弛豫结构的每原子校正后能量,即材料项目网站上对应mp-id的能量值
-'ef_per_atom_relaxed': [eV/atom] 弛豫结构的每原子形成能
-'force': [eV/Å] 原子所受受力
-'stress': [kBar] 晶胞所受应力
-'magmom': [μB] 原子磁矩
-'bandgap': [eV] 带隙
-'frame-id-1'……
-'mp-id-1'……
备注:
1. 帧ID的语法格式为:`task_id-calc_id-ionic_step`,其中针对每个材料项目弛豫任务的双弛豫过程,`calc_id`取值为0(第二次弛豫)或1(第一次弛豫)。
2. 由于MPtrj数据集同时包含GGA与GGA+U两种计算类型,二者能量数值存在差异,因此对VASP原始能量应用了MP2020兼容性校正,以实现GGA与GGA+U计算的通用兼容。用于预训练CHGNet的能量标签为经MP2020校正后的`energy_per_atom`。
参考链接:https://pymatgen.org/pymatgen.entries.html#pymatgen.entries.compatibility.Compatibility
3. MPtrj数据集中存在缺失的MAGMOM标签,我们将其赋值为None,但这并不代表磁矩数值为0。CHGNet基于DFT磁矩的绝对值进行训练,即使用MPtrj数据集中标签的绝对值。若使用官方提供的数据集,单位转换会自动完成,详见:https://github.com/CederGroupHub/chgnet/blob/main/chgnet/data/dataset.py
4. MPtrj数据集中的应力值为VASP输出的原始应力值。CHGNet输出的应力单位为GPa,其与MPtrj中VASP原始应力的换算关系为:CHGNet输出应力 = -0.1 × MPtrj中的VASP原始应力。该单位转换已在CHGNet数据集工具中内置,因此将数据传入数据集对象时无需手动进行单位转换。
引用说明:若您使用CHGNet或MPtrj数据集,请引用以下文献:
@article{deng_2023_chgnet,
title={CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling},
DOI={10.1038/s42256-023-00716-3},
journal={Nature Machine Intelligence},
author={Deng, Bowen and Zhong, Peichen and Jun, KyuJung and Riebesell, Janosh and Han, Kevin and Bartel, Christopher J. and Ceder, Gerbrand},
year={2023},
pages={1–11}
}
提供机构:
figshare
创建时间:
2023-08-25
搜集汇总
数据集介绍

背景与挑战
背景概述
Materials Project Trajectory (MPtrj) Dataset是一个用于训练预训练CHGNet模型的大规模材料科学数据集,包含约158万个结构及其相关的能量、力、应力、磁矩和带隙等物理量,数据来源于Materials Project的GGA/GGA+U计算轨迹。该数据集经过MP2020兼容性处理以确保数据一致性,并提供了详细的格式说明和单位转换指南,支持电荷信息原子建模研究。
以上内容由遇见数据集搜集并总结生成



