aqcat25-dataset
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/SandboxAQ/aqcat25-dataset
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center" style="font-size: 36px;">
<span style="color: #FFD700;">AQCat25 Dataset:</span> Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis
</h1>

This repository contains the **AQCat25 dataset**. AQCat25-EV2 models can be accessed [here](https://huggingface.co/SandboxAQ/aqcat25-ev2).
The AQCat25 dataset provides a large and diverse collection of **13.5 million** DFT calculation trajectories, encompassing approximately 5K materials and 47K intermediate-catalyst systems. It is designed to complement existing large-scale datasets by providing calculations at **higher fidelity** and including critical **spin-polarized** systems, which are essential for accurately modeling many industrially relevant catalysts.
Please see our [website](https://www.sandboxaq.com/aqcat25) and [paper](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf) for more details about the impact of the dataset and [models](https://huggingface.co/SandboxAQ/aqcat25-ev2).
## 1. AQCat25 Dataset Details
This repository uses a hybrid approach, providing lightweight, queryable Parquet files for each split alongside compressed archives (`.tar.gz`) of the raw ASE database files. More details can be found below.
### Queryable Metadata (Parquet Files)
A set of Parquet files provides a "table of contents" for the dataset. They can be loaded directly with the `datasets` library for fast browsing and filtering. Each file contains the following columns:
| Column Name | Data Type | Description | Example |
| :--- | :--- | :--- | :--- |
| `frame_id` | string | **Unique ID for this dataset**. Formatted as `database_name::index`. | `data.0015.aselmdb::42` |
| `adsorption_energy`| float | **Key Target**. The calculated adsorption energy in eV. | -1.542 |
| `total_energy` | float | The raw total energy of the adslab system from DFT (in eV). | -567.123 |
| `fmax` | float | The maximum force magnitude on any single atom in eV/Å. | 0.028 |
| `is_spin_off` | boolean | `True` if the system is non-magnetic (VASP ISPIN=1). | `false` |
| `mag` | float | The total magnetization of the system (µB). | 32.619 |
| `slab_id` | string | Identifier for the clean slab structure. | `mp-1216478_001_2_False` |
| `adsorbate` | string | SMILES or chemical formula of the adsorbate. | `*NH2N(CH3)2` |
| `is_rerun` | boolean | `True` if the calculation is a continuation. | `false` |
| `is_md` | boolean | `True` if the frame is from a molecular dynamics run. | `false` |
| `sid` | string | The original system ID from the source data. | `vadslabboth_82` |
| `fid` | integer | The original frame index (step number) from the source VASP calculation. | 0 |
---
#### Understanding `frame_id` and `fid`
| Field | Purpose | Example |
| :--- | :--- | :--- |
| `fid` | **Original Frame Index**: This is the step number from the original VASP relaxation (`ionic_steps`). It tells you where the frame came from in its source simulation. | `4` (the 5th frame of a specific VASP run) |
| `frame_id` | **Unique Dataset Pointer**: This is a new ID created for this specific dataset. It tells you exactly which file (`data.0015.aselmdb`) and which row (`101`) to look in to find the full atomic structure. | `data.0015.aselmdb::101` |
---
## Downloadable Data Archives
The full, raw data for each split is available for download in compressed `.tar.gz` archives. The table below provides direct download links. The queryable Parquet files for each split can be loaded directly using the `datasets` library as shown in the "Example Usage" section.
The data currently available for download (totaling ~11.1M frames, as listed in the table below) is the initial dataset version (v1.0) released on September 10, 2025. The 13.5M frame count mentioned in our paper and the introduction includes additional data used to rebalance non-magnetic element systems and add a low-fidelity spin-on dataset. These new data splits will be added to this repository soon.
| Split Name | Structures | Archive Size | Download Link |
| :--- | :--- | :--- | :--- |
| ***In-Domain (ID)*** | | | |
| Train | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) |
| Validation | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) |
| Test | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) |
| Slabs | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) |
| ***Out-of-Distribution (OOD) Validation*** | | | |
| OOD Ads (Val) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) |
| OOD Materials (Val) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) |
| OOD Both (Val) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) |
| OOD Slabs (Val) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) |
| ***Out-of-Distribution (OOD) Test*** | | | |
| OOD Ads (Test) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) |
| OOD Materials (Test) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) |
| OOD Both (Test) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) |
| OOD Slabs (Test) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) |
---
## 2. Dataset Usage Guide
This guide outlines the recommended workflow for accessing and querying the AQCat25 dataset.
### 2.1 Initial Setup
Before you begin, you need to install the necessary libraries and authenticate with Hugging Face. This is a one-time setup.
```bash
pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends
```
**1. Create a Hugging Face Account:**
If you don't have one, create an account at [huggingface.co](https://huggingface.co/join).
**2. Create an Access Token:**
Navigate to your **Settings -> Access Tokens** page or click [here](https://huggingface.co/settings/tokens). Create a new token with at least **`read`** permissions. Copy this token to your clipboard.
**3. Log in via the Command Line:**
Open your terminal and run the following command:
```bash
hf auth login
```
### 2.2 Get the Helper Scripts
You may copy the scripts directly from this repository, or download them by running the following in your local python environment:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="SandboxAQ/aqcat25",
repo_type="dataset",
allow_patterns=["scripts/*", "README.md"],
local_dir="./aqcat25"
)
```
This will create a local folder named aqcat25 containing the scripts/ directory.
### 2.3 Download Desired Dataset Splits
Data splits may be downloaded directly via the Hugging Face UI, or via the `download_split.py` script (found in `aqcat25/scripts/`).
```bash
python aqcat25/scripts/download_split.py --split val_id
```
This will download `val_id.tar.gz` and extract it to a new folder named `aqcat_data/val_id/`.
### 2.4 Query the Dataset
Use the `query_aqcat.py` script to filter the dataset and extract the specific atomic structures you need. It first queries the metadata on the Hub and then extracts the full structures from your locally downloaded files.
**Example 1: Find all CO and OH structures in the test set:**
```bash
python aqcat25/scripts/query_aqcat.py \
--split test_id \
--adsorbates "*CO" "*OH" \
--data-root ./aqcat_data/test_id
```
**Example 2: Find structures on metal slabs with low adsorption energy:**
```bash
python aqcat25/scripts/query_aqcat.py \
--split val_ood_both \
--max-energy -2.0 \
--material-type nonmetal \
--magnetism magnetic \
--data-root ./aqcat_data/val_ood_both \
--output-file low_energy_metals.extxyz
```
**Example 3: Find CO on slabs containing both Ni AND Se with adsorption energy between -2.5 and -1.5 eV with a miller index of 011**
```bash
python aqcat25/scripts/query_aqcat.py \
--split val_ood_ads \
--adsorbates "*COCH2OH" \
--min-energy -2.5 \
--max-energy -1.5 \
--contains-elements "Ni" "Se" \
--element-filter-mode all \
--facet 011 \
--data-root ./aqcat_data/val_ood_ads \
--output-file COCH2OH_on_ni_and_se.extxyz
```
---
## 3. How to Cite
If you use the AQCat25 dataset or the models in your research, please cite the following paper:
```
Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis. arXiv preprint arXiv:XXXX.XXXXX.
```
### BibTeX Entry
```bibtex
@article{allam2025aqcat25,
title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}},
author={Allam, Omar and Wander, Brook and Singh, Aayush R},
journal={arXiv preprint arXiv:2510.22938},
year={2025},
eprint={2510.22938},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
}
```
<h1 align="center" style="font-size: 36px;"><span style="color: #FFD700;">AQCat25 数据集:</span>解锁面向多相催化的自旋感知、高保真机器学习势函数</h1>

本仓库包含**AQCat25 数据集**。AQCat25-EV2 模型可通过[此处](https://huggingface.co/SandboxAQ/aqcat25-ev2)获取。
AQCat25 数据集包含规模庞大且种类多样的**1350万条**密度泛函理论(Density Functional Theory, DFT)计算轨迹,涵盖约5000种材料与47000个中间体-催化体系。该数据集旨在补充现有大规模数据集,通过提供**更高保真度**的计算结果,并纳入关键的**自旋极化**体系——这些体系对精准建模诸多工业相关催化剂至关重要。
如需了解该数据集与模型的更多影响细节,请参阅我们的[官方网站](https://www.sandboxaq.com/aqcat25)与[研究论文](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf)。
## 1. AQCat25 数据集详情
本仓库采用混合存储方案,为每个数据拆分提供轻量、可查询的Parquet文件,同时附带原始原子模拟环境(Atomic Simulation Environment, ASE)数据库文件的压缩归档(`.tar.gz`)。更多细节如下文所述。
### 可查询元数据(Parquet 文件)
Parquet 文件组为数据集提供了「目录索引」。可直接通过`datasets`库加载,实现快速浏览与筛选。每个文件包含以下列:
| 列名 | 数据类型 | 描述 | 示例 |
| :--- | :--- | :--- | :--- |
| `frame_id` | string | **唯一数据集标识符**。格式为 `database_name::index`。 | `data.0015.aselmdb::42` |
| `adsorption_energy`| float | **核心目标变量**。计算得到的吸附能,单位为电子伏特(eV)。 | -1.542 |
| `total_energy` | float | DFT 计算得到的吸附- slab 体系的原始总能量(单位:eV)。 | -567.123 |
| `fmax` | float | 任意单个原子所受最大力的模长,单位为 eV/Å。 | 0.028 |
| `is_spin_off` | boolean | 若体系为非磁性(VASP 的 ISPIN=1),则为 `True`。 | `false` |
| `mag` | float | 体系的总磁化强度,单位为玻尔磁子(µB)。 | 32.619 |
| `slab_id` | string | 清洁 slab 结构的标识符。 | `mp-1216478_001_2_False` |
| `adsorbate` | string | 吸附质的简化分子线性输入规范(Simplified Molecular Input Line Entry System, SMILES)表达式或化学式。 | `*NH2N(CH3)2` |
| `is_rerun` | boolean | 若该计算为续算任务,则为 `True`。 | `false` |
| `is_md` | boolean | 若该帧来自分子动力学模拟,则为 `True`。 | `false` |
| `sid` | string | 源数据中的原始系统标识符。 | `vadslabboth_82` |
| `fid` | integer | 源 VASP 计算中的原始帧索引(步数)。 | 0 |
---
#### `frame_id` 与 `fid` 字段说明
| 字段 | 用途 | 示例 |
| :--- | :--- | :--- |
| `fid` | **原始帧索引**:即原始 VASP 弛豫(ionic_steps)中的步数,用于标识该帧在源模拟中的来源位置。 | `4`(某特定 VASP 运行的第5帧) |
| `frame_id` | **唯一数据集指针**:为该数据集专门创建的新标识符,用于精准定位包含完整原子结构的文件(`data.0015.aselmdb`)与对应的行(`101`)。 | `data.0015.aselmdb::101` |
---
## 可下载数据归档
每个数据拆分的完整原始数据均以压缩 `.tar.gz` 归档形式提供下载,下表提供直接下载链接。每个拆分的可查询Parquet文件可直接通过`datasets`库加载,具体用法见「示例用法」章节。
当前可下载的数据为2025年9月10日发布的初始版本(v1.0),总帧数约为1110万(如下表所列)。论文与引言中提及的1350万帧总数,包含了用于平衡非磁性元素体系与添加低保真度自旋极化数据集的额外数据。这些新增的数据拆分将尽快添加至本仓库。
| 拆分名称 | 结构数量 | 归档大小 | 下载链接 |
| :--- | :--- | :--- | :--- |
| ***域内(ID)*** | | | |
| 训练集 | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) |
| 验证集 | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) |
| 测试集 | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) |
| Slab数据集 | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) |
| ***分布外(OOD)验证集*** | | | |
| OOD 吸附(验证) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) |
| OOD 材料(验证) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) |
| OOD 混合(验证) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) |
| OOD Slab(验证) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) |
| ***分布外(OOD)测试集*** | | | |
| OOD 吸附(测试) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) |
| OOD 材料(测试) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) |
| OOD 混合(测试) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) |
| OOD Slab(测试) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) |
---
## 2. 数据集使用指南
本指南概述了访问与查询 AQCat25 数据集的推荐流程。
### 2.1 初始配置
使用前需安装必要的依赖库并完成 Hugging Face 身份验证,该配置仅需执行一次。
bash
pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends
**1. 创建 Hugging Face 账号:**
若尚未拥有账号,请前往 [huggingface.co](https://huggingface.co/join) 注册。
**2. 创建访问令牌:**
前往**设置 -> 访问令牌**页面,或点击[此处](https://huggingface.co/settings/tokens)。创建至少具备**`read`权限**的新令牌,并将其复制到剪贴板。
**3. 命令行登录:**
bash
hf auth login
### 2.2 获取辅助脚本
您可直接从本仓库复制脚本,或在本地 Python 环境中运行以下代码进行下载:
python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="SandboxAQ/aqcat25",
repo_type="dataset",
allow_patterns=["scripts/*", "README.md"],
local_dir="./aqcat25"
)
该命令将创建名为 `aqcat25` 的本地文件夹,其中包含 `scripts/` 目录。
### 2.3 下载所需数据拆分
数据拆分可直接通过 Hugging Face 界面下载,或通过`download_split.py`脚本(位于`aqcat25/scripts/`目录下)下载。
bash
python aqcat25/scripts/download_split.py --split val_id
该命令将下载 `val_id.tar.gz` 并将其解压至 `aqcat_data/val_id/` 目录。
### 2.4 查询数据集
使用`query_aqcat.py`脚本可筛选数据集并提取所需的特定原子结构。该脚本首先会在 Hub 上查询元数据,随后从本地下载的文件中提取完整结构。
**示例1:在测试集中查找所有CO与OH吸附结构:**
bash
python aqcat25/scripts/query_aqcat.py
--split test_id
--adsorbates "*CO" "*OH"
--data-root ./aqcat_data/test_id
**示例2:查找吸附能较低的非金属slab上的结构:**
bash
python aqcat25/scripts/query_aqcat.py
--split val_ood_both
--max-energy -2.0
--material-type nonmetal
--magnetism magnetic
--data-root ./aqcat_data/val_ood_both
--output-file low_energy_metals.extxyz
**示例3:查找在同时包含Ni与Se的slab上的*COCH2OH吸附结构,且吸附能介于-2.5 eV至-1.5 eV之间,晶面指数为011:**
bash
python aqcat25/scripts/query_aqcat.py
--split val_ood_ads
--adsorbates "*COCH2OH"
--min-energy -2.5
--max-energy -1.5
--contains-elements "Ni" "Se"
--element-filter-mode all
--facet 011
--data-root ./aqcat_data/val_ood_ads
--output-file COCH2OH_on_ni_and_se.extxyz
---
## 3. 引用方式
若您在研究中使用 AQCat25 数据集或相关模型,请引用以下论文:
Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: 解锁面向多相催化的自旋感知、高保真机器学习势函数. arXiv 预印本 arXiv:XXXX.XXXXX.
### BibTeX 引用格式
bibtex
@article{allam2025aqcat25,
title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}},
author={Allam, Omar and Wander, Brook and Singh, Aayush R},
journal={arXiv preprint arXiv:2510.22938},
year={2025},
eprint={2510.22938},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
}
提供机构:
maas
创建时间:
2025-10-30



