xpanceo-team/alexandria
收藏Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xpanceo-team/alexandria
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: material_id
dtype: large_string
- name: formula
dtype: large_string
- name: energy
dtype: float64
- name: e_above_hull
dtype: float64
- name: band_gap
dtype: float64
- name: total_mag
dtype: float64
- name: n_sites
dtype: uint32
- name: structure
dtype: large_string
splits:
- name: train
num_bytes: 14568806338
num_examples: 5068744
download_size: 4927116810
dataset_size: 14568806338
license: cc-by-4.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- materials-science
- crystal-structures
pretty_name: Alexandria Materials Database
size_categories:
- 1M<n<10M
---
# Dataset Card for xpanceo-team/alexandria
## Dataset Summary
This dataset is a snapshot of the **Alexandria** materials database, republished with a standardized `crystal-diffusers` schema.
Snapshot date: **2026-01-16**.
The `structure` column stores **pymatgen `Structure` JSON** serialized as a string.
## Preprocessing (TBD)
...
## Dataset Structure
### Data Instances
Each row corresponds to one material entry identified by `material_id`, with computed properties and an associated crystal structure.
### Data Fields
- `material_id` (string): Alexandria identifier.
- `formula` (string): Chemical formula.
- `energy` (float): Energy value from the Alexandria snapshot.
- `e_above_hull` (float): Energy above hull from the Alexandria snapshot.
- `band_gap` (float): Band gap from the Alexandria snapshot.
- `total_mag` (float): Total magnetization from the Alexandria snapshot.
- `n_sites` (int): Number of atomic sites.
- `structure` (string): **pymatgen `Structure` JSON** (serialized).
Note: Numeric fields are preserved as-is from the Alexandria snapshot. Refer to the original Alexandria documentation for exact units/definitions.
### Data Splits
Single split:
- `train`: 5,068,744 examples
## Usage
Load the dataset:
```python
from datasets import load_dataset
ds = load_dataset("xpanceo-team/alexandria", split="train")
print(ds)
print(ds.column_names)
```
Convert to pandas (can be heavy for 5M rows):
```python
df = ds.to_pandas()
```
Parse `structure` with pymatgen:
```python
from pymatgen.core import Structure
row = ds[0]
structure = Structure.from_str(row["structure"], fmt="json")
print(structure.composition, len(structure))
```
## Citation
If you use this dataset, please cite the upstream Alexandria publication and acknowledge this Hugging Face repackaging.
```bibtex
@article{SCHMIDT2024101560,
title = {Improving machine-learning models in materials science through large datasets},
journal = {Materials Today Physics},
volume = {48},
pages = {101560},
year = {2024},
issn = {2542-5293},
doi = {https://doi.org/10.1016/j.mtphys.2024.101560},
url = {https://www.sciencedirect.com/science/article/pii/S2542529324002360},
author = {Jonathan Schmidt and Tiago F.T. Cerqueira and Aldo H. Romero and Antoine Loew and Fabian Jäger and Hai-Chen Wang and Silvana Botti and Miguel A.L. Marques},
abstract = {The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.}
}
```
## License
This dataset is distributed under **Creative Commons Attribution 4.0 (CC BY 4.0)**, consistent with the upstream Alexandria database license.
提供机构:
xpanceo-team



