alexandria
收藏Alexandria Materials Database 数据集概述
数据集基本信息
- 数据集名称: Alexandria Materials Database
- 发布者: xpanceo-team
- 数据集地址: https://huggingface.co/datasets/xpanceo-team/alexandria
- 许可证: Creative Commons Attribution 4.0 (CC BY 4.0)
- 标签: materials-science, crystal-structures
- 数据规模: 1M < n < 10M
- 快照日期: 2026-01-16
数据集内容与结构
该数据集是 Alexandria 材料数据库的一个快照,按照标准化的 crystal-diffusers 模式重新发布。
数据实例
每个数据行对应一个由 material_id 标识的材料条目,包含计算属性和相关的晶体结构。
数据字段
material_id(string): Alexandria 数据库中的标识符。formula(string): 化学式。energy(float): Alexandria 快照中的能量值。e_above_hull(float): Alexandria 快照中的 hull 能量。band_gap(float): Alexandria 快照中的带隙。total_mag(float): Alexandria 快照中的总磁化强度。n_sites(int): 原子位点数。structure(string): pymatgenStructureJSON 序列化字符串。
注意: 数值字段均按 Alexandria 快照原样保留。具体单位/定义请参考原始 Alexandria 文档。
数据划分
单一划分:
train: 包含 5,068,744 个样本。
技术详情
- 下载大小: 4,927,116,810 字节
- 数据集大小: 14,568,806,338 字节
- 配置名称: default
使用方法
加载数据集
python from datasets import load_dataset ds = load_dataset("xpanceo-team/alexandria", split="train")
解析结构数据
python from pymatgen.core import Structure row = ds[0] structure = Structure.from_str(row["structure"], fmt="json")
引用
若使用此数据集,请引用上游 Alexandria 出版物并注明此 Hugging Face 重新打包版本。 bibtex @article{SCHMIDT2024101560, title = {Improving machine-learning models in materials science through large datasets}, journal = {Materials Today Physics}, volume = {48}, pages = {101560}, year = {2024}, issn = {2542-5293}, doi = {https://doi.org/10.1016/j.mtphys.2024.101560}, url = {https://www.sciencedirect.com/science/article/pii/S2542529324002360}, author = {Jonathan Schmidt and Tiago F.T. Cerqueira and Aldo H. Romero and Antoine Loew and Fabian Jäger and Hai-Chen Wang and Silvana Botti and Miguel A.L. Marques}, abstract = {The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.} }




