five

WyFormer generated structures

收藏
DataCite Commons2025-05-25 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/WyFormer_generated_structures/29094701
下载链接
链接失效反馈
官方服务:
资源简介:
WyFormer generated datasetsStructures generated by WyFormer, with various post processing. Used in the ICML 2025 paper "Wyckoff Transformer: Generation of Symmetric Crystals".The folder structure is the following: the first is the dataset which was used for training WyFormer, using only train and validation parts. Then the folder structure corresponds to transformations of the data.<i>mp_20/WyckoffTransformer</i> 10k formally valid Wyckoff representations generated by WyFormer trained on MP-20 dataset.<i>mp_20/WyckoffTransformer/DiffCSP++10k</i> 9999 structures obtained with DifCSP++; it failed for one Wyckoff representation, we consider this structure unstable. <b>Can be considered as the "official" WyFormer sample.</b><i>mp_20/WyckoffTransformer/DiffCSP++10k/CHGNet_free/DFT</i> CHGNet pre-relaxation followed by DFT relaxation; for some structures the DFT relaxation failed, we consider them unstable. The relaxation was obtained using MP-compatiable <code>MPGGADoubleRelaxStaticMaker</code>. Note that material indices unfortunately got permuted at the CHGNet pre-relaxations step. Used in Table 1. <b>Can be considered as the "official" WyFormer DFT-relaxed sample.</b><i>mp_20/WyckoffTransformer/DiffCSP++10k/CHGNet_free/DFT-GGA-relax-1</i> same as above, but relaxed with a single invocation of <code>MPRelaxSet</code>. This is less precise, not strictly compatible to Materials Project, but is the same as reported in FlowMM paper and code. Used in Table 1.<i>mp_20/WyckoffTransformer/DiffCSP++/</i> 1k structures obtained with DifCSP++<i>mp_20/WyckoffTransformer/DiffCSP++/DFT/</i> DFT relaxation of 105 <i>novel and unique</i> structures, <code>MPGGADoubleRelaxStaticMaker</code><i>mp_20/WyckoffTransformer/CrySPR/CHGNet_fix/</i> 1k structures obtained with CrySPR and CHGNet, <i>whith a constraint during the relaxation that maintained the Wyckoff positions</i><i>mp_20/WyckoffTransformer/CrySPR/CHGNet_fix/DFT/</i> DFT relaxation of 105 <i>novel and unique</i> structures, <code>MPGGADoubleRelaxStaticMaker.</code><i>mpts_52/WyckoffTransformer/CrySPR/CHGNet_fix</i> 1k structures generated with Wyformer trained on MPTS-52 dataset, then CrySPR and CHGNet, <i>with a constraint during the relaxation that maintained the Wyckoff positions</i>.Format description<code>structure</code> - <code>pymatgen.core.structure.Structure</code><code>group</code>, <code>species</code>, <code>numIons,</code> <code>sites</code> - arguments to <code>pyxtal.from_random</code>. For <code>*/WyckoffTransformer/data.csv.gz</code> they were generated with WyFormer, for the rest they were obtained from structures with <code>pyxtal.from_seed</code>. Note the the indexing within those fields is by chemical element, not by Wyckoff position.<code>site_symmetries</code>, <code>elements</code>, <code>multiplicity</code>, <code>wyckoff_letters</code>, <code>sites_enumeration</code>, <code>dof</code> - information about the Wyckoff positions, indexed by Wyckoff position. The <code>dof</code> is the number of degrees of freedom for the Wyckoff position, i.e. the number of free parameters in the Wyckoff position. <code>sites_enumeration</code> enumerates the Wyckoff position with the same site symmetry, see the paper for details. For example, for space group <code>2</code> aka <code>P-1</code>, Wyckoff position <code>a</code> has site symmetry <code>-1</code> and enumeration <code>0</code>, while <code>b</code> has site symmetry <code>-1</code> and enumeration <code>1</code>.<code>sites_enumeration_augmented</code> - possible variants of the enumeration, depend on the arbitrary choice of the space group Euclidean normalizer, e. g. unit cell center. See the preprint for details.<code>smact_validity</code> - "Compositional Validity" computed with SMACT. Not all structures in MP-20 conform to this criterion.<code>structural_validity</code> - "Structural Validity" introduced by CDVAE, whether any two atoms are closer than 0.5 Angstroms<code>cdvae_e</code> - energy predicted by the model included in CDVAE, used for EMD(E) distribution similarity metric<code>chgnet_energy_per_atom</code> - energy per atom from CHGNet relaxation<code>chgnet_e_above_hull_corrected</code> - energy above hull from CHGNet relaxation, taking into account MP energy correction<code>dft_e_uncorrected</code> - raw potential energy from DFT relaxation<code>dft_e_corrected</code> - potential energy from DFT relaxation, corrected with <code>MaterialsProject2020Compatibility</code><code>dft_e_above_hull_corrected</code> - energy above hull computed from DFT relaxation computed using <code>2023-02-07-ppd-mp.pkl.gz</code> distributed by matbench-discovery as reference.<code>entry</code> - <code>pymatgen.entries.ComputedEntry</code> containing the results of the DFT run.Citatation<pre>@article{kazeev2025wyckoff,<br> title={{Wyckoff Transformer: Generation of symmetric crystals}},<br> author={Kazeev, Nikita and Nong, Wei and Romanov, Ignat and Zhu, Ruiming and Ustyuzhanin, Andrey and Yamazaki, Shuya and Hippalgaonkar, Kedar},<br> journal={arXiv preprint arXiv:2503.02407},<br> year={2025}<br>}</pre>

本数据集为WyFormer生成数据集,由WyFormer生成并经过各类后处理流程,用于ICML 2025收录的论文《Wyckoff Transformer:对称晶体生成》(Wyckoff Transformer: Generation of Symmetric Crystals)。 数据集的文件夹结构如下:首个数据集为用于训练WyFormer的基础数据集,仅包含训练集与验证集;其余文件夹对应数据的各类变换形式。 1. *mp_20/WyckoffTransformer*:基于MP-20(Materials Project 20)数据集训练得到的WyFormer生成的10000份形式合法的Wyckoff表示。 2. *mp_20/WyckoffTransformer/DiffCSP++10k*:通过DiffCSP++生成的9999个晶体结构(因1个Wyckoff表示生成失败,视为不稳定结构),可视为**官方标准的WyFormer采样结果**。 3. *mp_20/WyckoffTransformer/DiffCSP++10k/CHGNet_free/DFT*:先通过CHGNet进行预弛豫,再进行DFT(密度泛函理论)弛豫的数据集;部分结构的DFT弛豫失败,视为不稳定结构。本次弛豫采用与Materials Project兼容的`MPGGADoubleRelaxStaticMaker`工具完成。需注意,在CHGNet预弛豫步骤中,材料的索引不幸发生了置换。该数据集用于表1,可视为**官方标准的经DFT弛豫的WyFormer采样结果**。 4. *mp_20/WyckoffTransformer/DiffCSP++10k/CHGNet_free/DFT-GGA-relax-1*:与上述数据集一致,但仅通过单次调用`MPRelaxSet`完成弛豫。该流程精度较低,且未严格兼容Materials Project标准,但与FlowMM论文及代码中采用的流程一致,用于表1。 5. *mp_20/WyckoffTransformer/DiffCSP++/*:通过DiffCSP++生成的1000个晶体结构。 6. *mp_20/WyckoffTransformer/DiffCSP++/DFT/*:对105个**新颖且唯一**的晶体结构进行DFT弛豫,采用`MPGGADoubleRelaxStaticMaker`工具。 7. *mp_20/WyckoffTransformer/CrySPR/CHGNet_fix/*:通过CrySPR与CHGNet生成的1000个晶体结构,弛豫过程中施加了约束以维持Wyckoff位置不变。 8. *mp_20/WyckoffTransformer/CrySPR/CHGNet_fix/DFT/*:对105个**新颖且唯一**的晶体结构进行DFT弛豫,采用`MPGGADoubleRelaxStaticMaker`工具。 9. *mpts_52/WyckoffTransformer/CrySPR/CHGNet_fix*:基于MPTS-52数据集训练得到的WyFormer生成的1000个晶体结构,后续通过CrySPR与CHGNet处理,弛豫过程中施加了约束以维持Wyckoff位置不变。 ### 数据格式规范 - `structure`:对应`pymatgen.core.structure.Structure`对象。 - `group`、`species`、`numIons`、`sites`:为`pyxtal.from_random`函数的输入参数。对于路径匹配`*/WyckoffTransformer/data.csv.gz`的数据集,由WyFormer生成;其余路径下的数据则通过`pyxtal.from_seed`从已有结构生成。需注意,上述字段的索引基于化学元素,而非Wyckoff位置。 - `site_symmetries`、`elements`、`multiplicity`、`wyckoff_letters`、`sites_enumeration`、`dof`:为Wyckoff位置的相关信息,以Wyckoff位置为索引。其中`dof`表示该Wyckoff位置的自由度,即该位置的自由参数数量。`sites_enumeration`用于枚举具有相同位点对称性的Wyckoff位置,详细说明参见论文。例如,对于空间群编号2(即`P-1`),Wyckoff位置`a`的位点对称性为`-1`,枚举编号为0;而Wyckoff位置`b`的位点对称性为`-1`,枚举编号为1。 - `sites_enumeration_augmented`:枚举的可选变体,取决于空间群欧式归一化子的任意选择(例如单胞中心),详细说明参见论文预印本。 - `smact_validity`:通过SMACT计算得到的**组分合法性**。MP-20数据集中并非所有结构均满足该准则。 - `structural_validity`:由CDVAE提出的**结构合法性**,即任意两个原子之间的距离均不小于0.5埃(Angstrom)。 - `cdvae_e`:CDVAE内置模型预测的能量,用于EMD(E)分布相似度度量。 - `chgnet_energy_per_atom`:CHGNet弛豫得到的单原子能量。 - `chgnet_e_above_hull_corrected`:经Materials Project能量校正后的CHGNet弛豫结果的凸包能量高于凸包值。 - `dft_e_uncorrected`:DFT弛豫得到的原始势能能量。 - `dft_e_corrected`:经`MaterialsProject2020Compatibility`校正后的DFT弛豫势能能量。 - `dft_e_above_hull_corrected`:基于matbench-discovery发布的参考文件`2023-02-07-ppd-mp.pkl.gz`计算得到的、经DFT弛豫后的凸包能量高于凸包值。 - `entry`:包含DFT运行结果的`pymatgen.entries.ComputedEntry`对象。 ### 引用格式 bibtex @article{kazeev2025wyckoff, title={{Wyckoff Transformer: Generation of symmetric crystals}}, author={Kazeev, Nikita and Nong, Wei and Romanov, Ignat and Zhu, Ruiming and Ustyuzhanin, Andrey and Yamazaki, Shuya and Hippalgaonkar, Kedar}, journal={arXiv preprint arXiv:2503.02407}, year={2025} }
提供机构:
figshare
创建时间:
2025-05-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作