five

xpanceo-team/alexandria

收藏
Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xpanceo-team/alexandria
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: material_id dtype: large_string - name: formula dtype: large_string - name: energy dtype: float64 - name: e_above_hull dtype: float64 - name: band_gap dtype: float64 - name: total_mag dtype: float64 - name: n_sites dtype: uint32 - name: structure dtype: large_string splits: - name: train num_bytes: 14568806338 num_examples: 5068744 download_size: 4927116810 dataset_size: 14568806338 license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train-* tags: - materials-science - crystal-structures pretty_name: Alexandria Materials Database size_categories: - 1M<n<10M --- # Dataset Card for xpanceo-team/alexandria ## Dataset Summary This dataset is a snapshot of the **Alexandria** materials database, republished with a standardized `crystal-diffusers` schema. Snapshot date: **2026-01-16**. The `structure` column stores **pymatgen `Structure` JSON** serialized as a string. ## Preprocessing (TBD) ... ## Dataset Structure ### Data Instances Each row corresponds to one material entry identified by `material_id`, with computed properties and an associated crystal structure. ### Data Fields - `material_id` (string): Alexandria identifier. - `formula` (string): Chemical formula. - `energy` (float): Energy value from the Alexandria snapshot. - `e_above_hull` (float): Energy above hull from the Alexandria snapshot. - `band_gap` (float): Band gap from the Alexandria snapshot. - `total_mag` (float): Total magnetization from the Alexandria snapshot. - `n_sites` (int): Number of atomic sites. - `structure` (string): **pymatgen `Structure` JSON** (serialized). Note: Numeric fields are preserved as-is from the Alexandria snapshot. Refer to the original Alexandria documentation for exact units/definitions. ### Data Splits Single split: - `train`: 5,068,744 examples ## Usage Load the dataset: ```python from datasets import load_dataset ds = load_dataset("xpanceo-team/alexandria", split="train") print(ds) print(ds.column_names) ``` Convert to pandas (can be heavy for 5M rows): ```python df = ds.to_pandas() ``` Parse `structure` with pymatgen: ```python from pymatgen.core import Structure row = ds[0] structure = Structure.from_str(row["structure"], fmt="json") print(structure.composition, len(structure)) ``` ## Citation If you use this dataset, please cite the upstream Alexandria publication and acknowledge this Hugging Face repackaging. ```bibtex @article{SCHMIDT2024101560, title = {Improving machine-learning models in materials science through large datasets}, journal = {Materials Today Physics}, volume = {48}, pages = {101560}, year = {2024}, issn = {2542-5293}, doi = {https://doi.org/10.1016/j.mtphys.2024.101560}, url = {https://www.sciencedirect.com/science/article/pii/S2542529324002360}, author = {Jonathan Schmidt and Tiago F.T. Cerqueira and Aldo H. Romero and Antoine Loew and Fabian Jäger and Hai-Chen Wang and Silvana Botti and Miguel A.L. Marques}, abstract = {The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.} } ``` ## License This dataset is distributed under **Creative Commons Attribution 4.0 (CC BY 4.0)**, consistent with the upstream Alexandria database license.
提供机构:
xpanceo-team
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作