alexandria

Hugging Face2026-01-18 更新2026-01-19 收录

下载链接：

https://huggingface.co/datasets/xpanceo-team/alexandria

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是Alexandria材料数据库的一个快照，重新发布时采用了标准化的`crystal-diffusers`模式。快照日期为2026年1月16日。数据集包含超过500万种材料的计算属性，如材料ID、化学式、能量值、能带结构、磁化强度等，以及晶体结构的JSON序列化字符串。数据集分为一个训练集，包含5,068,744个样本，主要用于材料科学的机器学习模型训练和验证。

This dataset is a snapshot of the Alexandria Materials Database, republished under the standardized `crystal-diffusers` schema. The snapshot date is January 16, 2026. The dataset contains computed properties of over 5 million materials, including material ID, chemical formula, energy values, band structures, magnetization, etc., as well as JSON-serialized strings of crystal structures. The dataset is split into a single training set with 5,068,744 samples, which is primarily intended for training and validating machine learning models in materials science.

创建时间：

2026-01-17

原始信息汇总

Alexandria Materials Database 数据集概述

数据集基本信息

数据集名称: Alexandria Materials Database
发布者: xpanceo-team
数据集地址: https://huggingface.co/datasets/xpanceo-team/alexandria
许可证: Creative Commons Attribution 4.0 (CC BY 4.0)
标签: materials-science, crystal-structures
数据规模: 1M < n < 10M
快照日期: 2026-01-16

数据集内容与结构

该数据集是 Alexandria 材料数据库的一个快照，按照标准化的 crystal-diffusers 模式重新发布。

数据实例

每个数据行对应一个由 material_id 标识的材料条目，包含计算属性和相关的晶体结构。

数据字段

material_id (string): Alexandria 数据库中的标识符。
formula (string): 化学式。
energy (float): Alexandria 快照中的能量值。
e_above_hull (float): Alexandria 快照中的 hull 能量。
band_gap (float): Alexandria 快照中的带隙。
total_mag (float): Alexandria 快照中的总磁化强度。
n_sites (int): 原子位点数。
structure (string): pymatgen Structure JSON 序列化字符串。

注意: 数值字段均按 Alexandria 快照原样保留。具体单位/定义请参考原始 Alexandria 文档。

数据划分

单一划分：

train: 包含 5,068,744 个样本。

技术详情

下载大小: 4,927,116,810 字节
数据集大小: 14,568,806,338 字节
配置名称: default

使用方法

加载数据集

python from datasets import load_dataset ds = load_dataset("xpanceo-team/alexandria", split="train")

解析结构数据

python from pymatgen.core import Structure row = ds[0] structure = Structure.from_str(row["structure"], fmt="json")

引用

若使用此数据集，请引用上游 Alexandria 出版物并注明此 Hugging Face 重新打包版本。 bibtex @article{SCHMIDT2024101560, title = {Improving machine-learning models in materials science through large datasets}, journal = {Materials Today Physics}, volume = {48}, pages = {101560}, year = {2024}, issn = {2542-5293}, doi = {https://doi.org/10.1016/j.mtphys.2024.101560}, url = {https://www.sciencedirect.com/science/article/pii/S2542529324002360}, author = {Jonathan Schmidt and Tiago F.T. Cerqueira and Aldo H. Romero and Antoine Loew and Fabian Jäger and Hai-Chen Wang and Silvana Botti and Miguel A.L. Marques}, abstract = {The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.} }

搜集汇总

数据集介绍

构建方式

在材料科学领域，构建高质量数据集是推动机器学习模型发展的关键。Alexandria数据集作为材料数据库的快照，其构建过程基于密度泛函理论计算，涵盖了超过五百万个周期性三维、二维和一维化合物的计算结果。该数据集通过标准化crystal-diffusers模式重新发布，确保了数据的一致性和可访问性。每个条目均包含材料标识符、化学式、能量值、能带隙等关键属性，并以pymatgen结构JSON序列化字符串形式存储晶体结构，为后续分析提供了结构化基础。

特点

Alexandria数据集以其规模宏大和内容详实而著称，包含五百余万条材料条目，覆盖了广泛的化学空间。数据字段设计科学，不仅保留了原始快照中的数值属性如能量、能带隙和总磁化强度，还通过序列化的晶体结构信息提供了几何细节。这种结合了成分特征与结构信息的数据组织形式，使得该数据集特别适用于训练基于图神经网络的机器学习模型，从而在材料性能预测中展现出更高的准确性。

使用方法

使用Alexandria数据集时，可通过Hugging Face的datasets库直接加载，便于快速集成到机器学习工作流中。加载后，用户可以利用pymatgen库解析晶体结构字符串，进而进行成分分析和结构操作。虽然数据集规模较大，但支持转换为pandas DataFrame以进行灵活的数据处理，不过需注意内存管理。该数据集适用于材料发现、性质预测及机器学习模型评估等多种研究场景，为材料科学领域的算法开发提供了坚实的数据支撑。

背景与挑战

背景概述

在材料科学领域，高通量计算与机器学习方法的融合正推动着新材料发现范式的革新。Alexandria材料数据库作为这一变革的产物，由Jonathan Schmidt等研究人员于2024年构建，旨在通过超过五百万个密度泛函理论计算数据，为机器学习模型提供大规模、高质量的训练基础。该数据库聚焦于周期性三维、二维及一维化合物，其核心研究问题在于解决材料科学中高质量一致数据集稀缺的瓶颈，从而提升机器学习模型在预测材料多种物性时的准确性与可靠性。Alexandria的发布显著促进了数据驱动材料设计的发展，为晶体图神经网络等先进模型提供了不可或缺的支撑。

当前挑战

Alexandria数据集致力于应对材料物性预测这一复杂领域问题，其挑战在于如何准确建模材料能量、带隙、磁化强度等多尺度性质，这些性质深受晶体结构与电子关联效应的影响。在构建过程中，研究人员面临整合海量密度泛函理论计算结果的艰巨任务，需确保数据的一致性与物理准确性，同时处理不同维度和化学空间的数据覆盖不均问题。此外，将晶体结构序列化为标准化JSON格式并保持与pymatgen库的兼容性，也对数据集的工程实现提出了技术要求。

常用场景

经典使用场景

在材料科学领域，Alexandria数据集为机器学习模型提供了大规模、高质量的训练基础。其经典使用场景集中于晶体结构预测与性质建模，研究人员利用该数据集中的数百万个密度泛函理论计算结果，训练晶体图神经网络或成分基模型，以准确预测材料的能量、带隙、磁化强度等关键物理性质。通过结合结构信息与计算属性，该数据集推动了高通量材料筛选与设计，成为加速新材料发现的重要工具。

实际应用

在实际应用层面，Alexandria数据集支撑了新材料研发与优化过程。工业界与科研机构利用其训练出的机器学习模型，快速评估候选材料的稳定性、电子特性及力学行为，显著缩短了实验周期与成本。例如，在能源存储领域，该数据集有助于筛选高性能电池电极材料；在半导体行业，则可指导带隙工程化设计，推动光电器件创新。这些应用体现了大数据驱动材料科学向高效、精准方向发展的趋势。

衍生相关工作

基于Alexandria数据集，衍生了一系列经典研究工作，主要集中在机器学习力场开发与材料发现算法优化。学者利用该数据集训练通用机器学习原子间势，以模拟晶体结构优化过程，尽管在能量精度方面仍存在挑战，但为复杂材料系统的动力学研究提供了新途径。此外，结合图神经网络与成分描述符的混合模型不断涌现，这些工作不仅提升了预测性能，还深入揭示了数据分布与模型饱和现象之间的关联，推动了材料信息学领域的理论进展。

以上内容由遇见数据集搜集并总结生成