OMAT24
收藏魔搭社区2026-01-07 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OMAT24
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center" style="font-size: 36px;">Meta Open Materials 2024 (OMat24) Dataset</h1>
<p align="center">
<img width="559" height="200" src="https://cdn-uploads.huggingface.co/production/uploads/67004f02d66ad0efb0d494c3/yYySyR4CZjnRr09MB33bS.png"?
</p>
## Overview
Several datasets were utilized in this work. We provide open access to all datasets used to help accelerate research in the community.
This includes the OMat24 dataset as well as our modified sAlex dataset. Details on the different datasets are provided below.
The OMat24 datasets can be used with the [FAIRChem package](https://fair-chem.github.io/). See section on "How to read the data" below for a minimal example.
## Datasets
### OMat24 Dataset
The OMat24 dataset contains a mix of single point calculations of non-equilibrium structures and
structural relaxations. The dataset contains structures labeled with total energy (eV), forces (eV/A)
and stress (eV/A^3). The dataset is provided in ASE DB compatible lmdb files.
We provide two splits - train and validation. Each split is comprised of several subdatasets based on the different input generation strategies, see paper for more details.
The OMat24 train and validation splits are fully compatible with the Matbench Discovery benchmark test set.
1. The splits do not contain any structure that has a protostructure label present in the initial or relaxed
structures of the WBM dataset.
2. The splits do not include any structure that was generated starting from an Alexandria relaxed structure with
protostructure lable in the intitial or relaxed structures of the WBM datset.
##### <ins> Train </ins>
| Sub-dataset | Size | Download |
| :-----: | :--: | :------: |
| rattled-1000 | 11,388,510 | [rattled-1000.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-1000.tar.gz) |
| rattled-1000-subsampled | 3,879,741 | [rattled-1000-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-1000-subsampled.tar.gz) |
| rattled-500 | 6,922,197 | [rattled-500.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-500.tar.gz) |
| rattled-500-subsampled | 3,975,416 | [rattled-500-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-500-subsampled.tar.gz) |
| rattled-300 | 6,319,139 | [rattled-300.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300.tar.gz) |
| rattled-300-subsampled | 3,464,007 | [rattled-300-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300-subsampled.tar.gz) |
| aimd-from-PBE-1000-npt | 21,269,486 | [aimd-from-PBE-1000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-1000-npt.tar.gz) |
| aimd-from-PBE-1000-nvt | 20,256,650 | [aimd-from-PBE-1000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-1000-nvt.tar.gz) |
| aimd-from-PBE-3000-npt | 6,076,290 | [aimd-from-PBE-3000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-3000-npt.tar.gz) |
| aimd-from-PBE-3000-nvt | 7,839,846 | [aimd-from-PBE-3000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-3000-nvt.tar.gz) |
| rattled-relax | 9,433,303 | [rattled-relax.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-relax.tar.gz) |
| Total | 100,824,585 | - |
##### <ins> Validation </ins>
Models were evaluated on a ~1M subset for training efficiency. We provide that set below.
**_NOTE:_** The original validation sets contained a duplicated structures. Corrected validation sets were uploaded on 20/12/24. Please see this [issue](https://github.com/FAIR-Chem/fairchem/issues/942)
for more details, an re-download the correct version of the validation sets if needed.
| Sub-dataset | Size | Download |
| :-----: | :--: | :------: |
| rattled-1000 | 117,004 | [rattled-1000.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-1000.tar.gz) |
| rattled-1000-subsampled | 39,785 | [rattled-1000-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-1000-subsampled.tar.gz) |
| rattled-500 | 71,522 | [rattled-500.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-500.tar.gz) |
| rattled-500-subsampled | 41,021 | [rattled-500-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-500-subsampled.tar.gz) |
| rattled-300 | 65,235 | [rattled-300.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-300.tar.gz) |
| rattled-300-subsampled | 35,579 | [rattled-300-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-300-subsampled.tar.gz) |
| aimd-from-PBE-1000-npt | 212,737 | [aimd-from-PBE-1000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-1000-npt.tar.gz) |
| aimd-from-PBE-1000-nvt | 205,165 | [aimd-from-PBE-1000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-1000-nvt.tar.gz) |
| aimd-from-PBE-3000-npt | 62,130 | [aimd-from-PBE-3000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-3000-npt.tar.gz) |
| aimd-from-PBE-3000-nvt | 79,977 | [aimd-from-PBE-3000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-3000-nvt.tar.gz) |
| rattled-relax | 95,206 | [rattled-relax.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-relax.tar.gz) |
| Total | 1,025,361 | - |
##### <ins> 1M Subsplit </ins>
We provide a 1M training randomly subsampled split, and the corresponding validation and test so that it is easier to iterate and develop before training on the full dataset.
Download: [omat24_1M](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/251210/omat24_1M_251210.tar.gz)
| Train sub-dataset | Size |
| :-----: | :--: |
| rattled-1000 | 113,337 |
| rattled-1000-subsampled | 38,603 |
| rattled-500 | 68,885 |
| rattled-500-subsampled | 39,548 |
| rattled-300 | 62,894 |
| rattled-300-subsampled | 34,462 |
| aimd-from-PBE-1000-npt | 211,578 |
| aimd-from-PBE-1000-nvt | 201,512 |
| aimd-from-PBE-3000-npt | 60,477 |
| aimd-from-PBE-3000-nvt | 78,011 |
| rattled-relax | 100,543 |
| Total | 1,009,850 |
### sAlex Dataset
We also provide the sAlex dataset used for fine-tuning of our OMat models. sAlex is a subsampled, Matbench-Discovery compliant, version of the original [Alexandria](https://alexandria.icams.rub.de/).
sAlex was created by removing structures matched in WBM and only sampling structure along a trajectory with an energy difference greater than 10 meV/atom. For full details,
please see the manuscript.
| Dataset | Split | Size | Download |
| :-----: | :---: | :--: | :------: |
| sAlex | train | 10,447,765 | [train.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/sAlex/train.tar.gz) |
| sAlex | val | 553,218 | [val.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/sAlex/val.tar.gz) |
## How to read the data
The OMat24 and sAlex datasets can be accessed with the [fairchem](https://github.com/FAIR-Chem/fairchem) library. This package can be installed with:
```
pip install fairchem-core
```
Dataset files are written as `AseLMDBDatabase` objects which are an implementation of an [ASE Database](https://wiki.fysik.dtu.dk/ase/ase/db/db.html),
in LMDB format. A single **.aselmdb* file can be read and queried like any other ASE DB (not recommended as there are many files!).
You can also read many DB files at once and access atoms objects using the `AseDBDataset` class.
For example to read the **rattled-relax** subdataset,
```python
from fairchem.core.datasets import AseDBDataset
dataset_path = "/path/to/omat24/train/rattled-relax"
config_kwargs = {} # see tutorial on additional configuration
dataset = AseDBDataset(config=dict(src=dataset_path, **config_kwargs))
# atoms objects can be retrieved by index
atoms = dataset.get_atoms(0)
```
To read more than one subdataset you can simply pass a list of subdataset paths,
```python
from fairchem.core.datasets import AseDBDataset
config_kwargs = {} # see tutorial on additional configuration
dataset_paths = [
"/path/to/omat24/train/rattled-relax",
"/path/to/omat24/train/rattled-1000-subsampled",
"/path/to/omat24/train/rattled-1000"
]
dataset = AseDBDataset(config=dict(src=dataset_paths, **config_kwargs))
```
To read all of the OMat24 training or validations splits simply pass the paths to all subdatasets.
## Support
If you run into any issues regarding feel free to post your questions or comments on any of the following platforms:
- [HF Discussions](https://huggingface.co/datasets/fairchem/OMAT24/discussions)
- [Github Issues](https://github.com/FAIR-Chem/fairchem/issues)
- [Discussion Board](https://discuss.opencatalystproject.org/)
## Citation
The OMat24 dataset is licensed under a [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/legalcode). If you use this work, please cite:
```
@misc{barroso_omat24,
title={Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models},
author={Luis Barroso-Luque and Muhammed Shuaibi and Xiang Fu and Brandon M. Wood and Misko Dzamba and Meng Gao and Ammar Rizvi and C. Lawrence Zitnick and Zachary W. Ulissi},
year={2024},
eprint={2410.12771},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2410.12771},
}
```
\### We hope to move our datasets and models to the Hugging Face Hub in the near future to make it more accessible by the community. \###
# Meta 开放材料2024(OMat24)数据集
<p align="center">
<img width="559" height="200" src="https://cdn-uploads.huggingface.co/production/uploads/67004f02d66ad0efb0d494c3/yYySyR4CZjnRr09MB33bS.png">
</p>
## 概述
本研究采用了多套数据集,我们将所有所用数据集公开获取,以助力社区相关研究加速推进,其中涵盖OMat24数据集与经过修改的sAlex数据集。各数据集的详细信息如下文所述。
OMat24数据集可配合FAIRChem工具包(FAIRChem package)使用,基础使用示例详见下文「数据读取方法」章节。
## 数据集
### OMat24数据集
OMat24数据集整合了非平衡结构的单点计算与结构弛豫两类数据。该数据集包含标注有总能量(单位:电子伏特,eV)、原子受力(单位:电子伏特每埃,eV/Å)与应力(单位:电子伏特每埃立方,eV/ų)的晶体结构。数据集以与ASE数据库(ASE Database)兼容的LMDB格式文件提供。
我们提供了训练集与验证集两个数据划分子集,每个子集均包含多套基于不同输入生成策略构建的子数据集,详细信息可查阅相关论文。
OMat24的训练集与验证集划分与Matbench Discovery基准测试集(Matbench Discovery)完全兼容:
1. 本数据划分未包含任何与WBM数据集初始结构或弛豫结构中带有原型结构(protostructure)标签的结构重复的样本;
2. 本数据划分未包含任何以WBM数据集初始结构或弛豫结构中带有原型结构标签的Alexandria弛豫结构为起点生成的样本。
##### <ins> 训练集 </ins>
| 子数据集 | 样本量 | 下载链接 |
| :-----: | :--: | :------: |
| rattled-1000 | 11,388,510 | [rattled-1000.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-1000.tar.gz) |
| rattled-1000-subsampled | 3,879,741 | [rattled-1000-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-1000-subsampled.tar.gz) |
| rattled-500 | 6,922,197 | [rattled-500.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-500.tar.gz) |
| rattled-500-subsampled | 3,975,416 | [rattled-500-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-500-subsampled.tar.gz) |
| rattled-300 | 6,319,139 | [rattled-300.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300.tar.gz) |
| rattled-300-subsampled | 3,464,007 | [rattled-300-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300-subsampled.tar.gz) |
| aimd-from-PBE-1000-npt | 21,269,486 | [aimd-from-PBE-1000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-1000-npt.tar.gz) |
| aimd-from-PBE-1000-nvt | 20,256,650 | [aimd-from-PBE-1000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-1000-nvt.tar.gz) |
| aimd-from-PBE-3000-npt | 6,076,290 | [aimd-from-PBE-3000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-3000-npt.tar.gz) |
| aimd-from-PBE-3000-nvt | 7,839,846 | [aimd-from-PBE-3000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/aimd-from-PBE-3000-nvt.tar.gz) |
| rattled-relax | 9,433,303 | [rattled-relax.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-relax.tar.gz) |
| 总计 | 100,824,585 | - |
##### <ins> 验证集 </ins>
为提升训练效率,我们采用约100万条数据的子集对模型进行评估,该子集的下载链接如下文所示。
**_注意_**:原始验证集存在重复结构问题,修正后的验证集已于2024年12月20日上传。详细信息可查阅该[议题页面](https://github.com/FAIR-Chem/fairchem/issues/942),如有需要请重新下载验证集的正确版本。
| 子数据集 | 样本量 | 下载链接 |
| :-----: | :--: | :------: |
| rattled-1000 | 117,004 | [rattled-1000.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-1000.tar.gz) |
| rattled-1000-subsampled | 39,785 | [rattled-1000-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-1000-subsampled.tar.gz) |
| rattled-500 | 71,522 | [rattled-500.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-500.tar.gz) |
| rattled-500-subsampled | 41,021 | [rattled-500-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-500-subsampled.tar.gz) |
| rattled-300 | 65,235 | [rattled-300.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-300.tar.gz) |
| rattled-300-subsampled | 35,579 | [rattled-300-subsampled.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-300-subsampled.tar.gz) |
| aimd-from-PBE-1000-npt | 212,737 | [aimd-from-PBE-1000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-1000-npt.tar.gz) |
| aimd-from-PBE-1000-nvt | 205,165 | [aimd-from-PBE-1000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-1000-nvt.tar.gz) |
| aimd-from-PBE-3000-npt | 62,130 | [aimd-from-PBE-3000-npt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-3000-npt.tar.gz) |
| aimd-from-PBE-3000-nvt | 79,977 | [aimd-from-PBE-3000-nvt.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/aimd-from-PBE-3000-nvt.tar.gz) |
| rattled-relax | 95,206 | [rattled-relax.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241220/omat/val/rattled-relax.tar.gz) |
| 总计 | 1,025,361 | - |
##### <ins> 1M 子集划分 </ins>
我们提供了一套从全量训练集中随机采样得到的100万条数据子集,以及配套的验证集与测试集,方便研究者在全量数据集训练前快速迭代与开发模型。
下载链接:[omat24_1M](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/251210/omat24_1M_251210.tar.gz)
| 训练子数据集 | 样本量 |
| :-----: | :--: |
| rattled-1000 | 113,337 |
| rattled-1000-subsampled | 38,603 |
| rattled-500 | 68,885 |
| rattled-500-subsampled | 39,548 |
| rattled-300 | 62,894 |
| rattled-300-subsampled | 34,462 |
| aimd-from-PBE-1000-npt | 211,578 |
| aimd-from-PBE-1000-nvt | 201,512 |
| aimd-from-PBE-3000-npt | 60,477 |
| aimd-from-PBE-3000-nvt | 78,011 |
| rattled-relax | 100,543 |
| 总计 | 1,009,850 |
### sAlex数据集
我们同时提供用于OMat系列模型微调的sAlex数据集。sAlex是原始[Alexandria](https://alexandria.icams.rub.de/)数据集的采样子集,且符合Matbench-Discovery数据集规范。
sAlex数据集通过移除WBM数据集中匹配的结构,并仅采样沿轨迹上原子能量差大于10毫电子伏特每原子(meV/atom)的结构构建而成。完整细节可查阅相关论文手稿。
| 数据集 | 划分集 | 样本量 | 下载链接 |
| :-----: | :---: | :--: | :------: |
| sAlex | 训练集 | 10,447,765 | [train.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/sAlex/train.tar.gz) |
| sAlex | 验证集 | 553,218 | [val.tar.gz](https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/sAlex/val.tar.gz) |
## 数据读取方法
OMat24与sAlex数据集可通过[fairchem](https://github.com/FAIR-Chem/fairchem)库进行读取与访问。该工具包可通过以下命令安装:
bash
pip install fairchem-core
数据集文件以`AseLMDBDatabase`对象格式存储,该格式是[ASE数据库(ASE Database)](https://wiki.fysik.dtu.dk/ase/ase/db/db.html)的LMDB实现版本。单个**.aselmdb**文件可像其他ASE数据库一样进行读取与查询(不推荐直接读取单个文件,因数据集包含大量此类文件)。
开发者也可通过`AseDBDataset`类一次性读取多个数据库文件并获取原子结构对象。
例如,若要读取**rattled-relax**子数据集,可使用如下代码:
python
from fairchem.core.datasets import AseDBDataset
dataset_path = "/path/to/omat24/train/rattled-relax"
config_kwargs = {} # 额外配置参数详见官方教程
dataset = AseDBDataset(config=dict(src=dataset_path, **config_kwargs))
# 可通过索引获取原子结构对象
atoms = dataset.get_atoms(0)
若需读取多个子数据集,只需传入子数据集路径列表即可,示例如下:
python
from fairchem.core.datasets import AseDBDataset
config_kwargs = {} # 额外配置参数详见官方教程
dataset_paths = [
"/path/to/omat24/train/rattled-relax",
"/path/to/omat24/train/rattled-1000-subsampled",
"/path/to/omat24/train/rattled-1000"
]
dataset = AseDBDataset(config=dict(src=dataset_paths, **config_kwargs))
若需读取完整的OMat24训练集或验证集划分,只需传入所有子数据集的路径即可。
## 技术支持
若您在使用过程中遇到任何问题,欢迎在以下任一平台提交疑问或反馈:
- [Hugging Face 讨论区](https://huggingface.co/datasets/fairchem/OMAT24/discussions)
- [GitHub 议题区](https://github.com/FAIR-Chem/fairchem/issues)
- [Open Catalyst 项目讨论板](https://discuss.opencatalystproject.org/)
## 引用声明
OMat24数据集采用[知识共享署名4.0国际许可协议](https://creativecommons.org/licenses/by/4.0/legalcode)进行授权。若您使用本数据集,请引用如下文献:
bibtex
@misc{barroso_omat24,
title={Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models},
author={Luis Barroso-Luque and Muhammed Shuaibi and Xiang Fu and Brandon M. Wood and Misko Dzamba and Meng Gao and Ammar Rizvi and C. Lawrence Zitnick and Zachary W. Ulissi},
year={2024},
eprint={2410.12771},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2410.12771},
}
### 我们计划在不久的将来将本数据集与模型迁移至Hugging Face Hub,以提升社区的使用便捷性。###
提供机构:
maas
创建时间:
2024-10-23
搜集汇总
数据集介绍

背景与挑战
背景概述
OMAT24数据集是一个开放的材料数据集,包含非平衡结构和结构弛豫的计算数据,分为训练和验证两个主要部分,总计超过1亿个数据点。数据集以lmdb格式提供,支持通过fairchem库进行高效访问和查询,适用于材料科学研究。
以上内容由遇见数据集搜集并总结生成



