five

aqcat25

收藏
魔搭社区2026-01-06 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/SandboxAQ/aqcat25
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center" style="font-size: 36px;"> <span style="color: #FFD700;">AQCat25 Dataset:</span> Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis </h1> ![datset_schematic](https://cdn-uploads.huggingface.co/production/uploads/67256b7931376d3bacb18de0/W1Orc_AmSgRez5iKH0qjC.jpeg) This repository contains the **AQCat25 dataset**. AQCat25-EV2 models can be accessed [here](https://huggingface.co/SandboxAQ/aqcat25-ev2). The AQCat25 dataset provides a large and diverse collection of **13.5 million** DFT calculation trajectories, encompassing approximately 5K materials and 47K intermediate-catalyst systems. It is designed to complement existing large-scale datasets by providing calculations at **higher fidelity** and including critical **spin-polarized** systems, which are essential for accurately modeling many industrially relevant catalysts. Please see our [website](https://www.sandboxaq.com/aqcat25) and [paper](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf) for more details about the impact of the dataset and [models](https://huggingface.co/SandboxAQ/aqcat25-ev2). ## 1. AQCat25 Dataset Details This repository uses a hybrid approach, providing lightweight, queryable Parquet files for each split alongside compressed archives (`.tar.gz`) of the raw ASE database files. More details can be found below. ### Queryable Metadata (Parquet Files) A set of Parquet files provides a "table of contents" for the dataset. They can be loaded directly with the `datasets` library for fast browsing and filtering. Each file contains the following columns: | Column Name | Data Type | Description | Example | | :--- | :--- | :--- | :--- | | `frame_id` | string | **Unique ID for this dataset**. Formatted as `database_name::index`. | `data.0015.aselmdb::42` | | `adsorption_energy`| float | **Key Target**. The calculated adsorption energy in eV. | -1.542 | | `total_energy` | float | The raw total energy of the adslab system from DFT (in eV). | -567.123 | | `fmax` | float | The maximum force magnitude on any single atom in eV/Å. | 0.028 | | `is_spin_off` | boolean | `True` if the system is non-magnetic (VASP ISPIN=1). | `false` | | `mag` | float | The total magnetization of the system (µB). | 32.619 | | `slab_id` | string | Identifier for the clean slab structure. | `mp-1216478_001_2_False` | | `adsorbate` | string | SMILES or chemical formula of the adsorbate. | `*NH2N(CH3)2` | | `is_rerun` | boolean | `True` if the calculation is a continuation. | `false` | | `is_md` | boolean | `True` if the frame is from a molecular dynamics run. | `false` | | `sid` | string | The original system ID from the source data. | `vadslabboth_82` | | `fid` | integer | The original frame index (step number) from the source VASP calculation. | 0 | --- #### Understanding `frame_id` and `fid` | Field | Purpose | Example | | :--- | :--- | :--- | | `fid` | **Original Frame Index**: This is the step number from the original VASP relaxation (`ionic_steps`). It tells you where the frame came from in its source simulation. | `4` (the 5th frame of a specific VASP run) | | `frame_id` | **Unique Dataset Pointer**: This is a new ID created for this specific dataset. It tells you exactly which file (`data.0015.aselmdb`) and which row (`101`) to look in to find the full atomic structure. | `data.0015.aselmdb::101` | --- ## Downloadable Data Archives The full, raw data for each split is available for download in compressed `.tar.gz` archives. The table below provides direct download links. The queryable Parquet files for each split can be loaded directly using the `datasets` library as shown in the "Example Usage" section. The data currently available for download (totaling ~11.1M frames, as listed in the table below) is the initial dataset version (v1.0) released on September 10, 2025. The 13.5M frame count mentioned in our paper and the introduction includes additional data used to rebalance non-magnetic element systems and add a low-fidelity spin-on dataset. These new data splits will be added to this repository soon. | Split Name | Structures | Archive Size | Download Link | | :--- | :--- | :--- | :--- | | ***In-Domain (ID)*** | | | | | Train | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) | | Validation | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) | | Test | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) | | Slabs | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Validation*** | | | | | OOD Ads (Val) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) | | OOD Materials (Val) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) | | OOD Both (Val) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) | | OOD Slabs (Val) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Test*** | | | | | OOD Ads (Test) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) | | OOD Materials (Test) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) | | OOD Both (Test) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) | | OOD Slabs (Test) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) | --- ## 2. Dataset Usage Guide This guide outlines the recommended workflow for accessing and querying the AQCat25 dataset. ### 2.1 Initial Setup Before you begin, you need to install the necessary libraries and authenticate with Hugging Face. This is a one-time setup. ```bash pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends ``` **1. Create a Hugging Face Account:** If you don't have one, create an account at [huggingface.co](https://huggingface.co/join). **2. Create an Access Token:** Navigate to your **Settings -> Access Tokens** page or click [here](https://huggingface.co/settings/tokens). Create a new token with at least **`read`** permissions. Copy this token to your clipboard. **3. Log in via the Command Line:** Open your terminal and run the following command: ```bash hf auth login ``` ### 2.2 Get the Helper Scripts You may copy the scripts directly from this repository, or download them by running the following in your local python environment: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="SandboxAQ/aqcat25", repo_type="dataset", allow_patterns=["scripts/*", "README.md"], local_dir="./aqcat25" ) ``` This will create a local folder named aqcat25 containing the scripts/ directory. ### 2.3 Download Desired Dataset Splits Data splits may be downloaded directly via the Hugging Face UI, or via the `download_split.py` script (found in `aqcat25/scripts/`). ```bash python aqcat25/scripts/download_split.py --split val_id ``` This will download `val_id.tar.gz` and extract it to a new folder named `aqcat_data/val_id/`. ### 2.4 Query the Dataset Use the `query_aqcat.py` script to filter the dataset and extract the specific atomic structures you need. It first queries the metadata on the Hub and then extracts the full structures from your locally downloaded files. **Example 1: Find all CO and OH structures in the test set:** ```bash python aqcat25/scripts/query_aqcat.py \ --split test_id \ --adsorbates "*CO" "*OH" \ --data-root ./aqcat_data/test_id ``` **Example 2: Find structures on metal slabs with low adsorption energy:** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_both \ --max-energy -2.0 \ --material-type nonmetal \ --magnetism magnetic \ --data-root ./aqcat_data/val_ood_both \ --output-file low_energy_metals.extxyz ``` **Example 3: Find CO on slabs containing both Ni AND Se with adsorption energy between -2.5 and -1.5 eV with a miller index of 011** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_ads \ --adsorbates "*COCH2OH" \ --min-energy -2.5 \ --max-energy -1.5 \ --contains-elements "Ni" "Se" \ --element-filter-mode all \ --facet 011 \ --data-root ./aqcat_data/val_ood_ads \ --output-file COCH2OH_on_ni_and_se.extxyz ``` --- ## 3. How to Cite If you use the AQCat25 dataset or the models in your research, please cite the following paper: ``` Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis. arXiv preprint arXiv:XXXX.XXXXX. ``` ### BibTeX Entry ```bibtex @article{allam2025aqcat25, title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}}, author={Allam, Omar and Wander, Brook and Singh, Aayush R}, journal={arXiv preprint arXiv:2510.22938}, year={2025}, eprint={2510.22938}, archivePrefix={arXiv}, primaryClass={cond-mat.mtrl-sci} } ```

<h1 align="center" style="font-size: 36px;"> <span style="color: #FFD700;">AQCat25 数据集:</span> 解锁面向多相催化的自旋感知、高保真机器学习势能 </h1> ![数据集示意图](https://cdn-uploads.huggingface.co/production/uploads/67256b7931376d3bacb18de0/W1Orc_AmSgRez5iKH0qjC.jpeg) 本仓库包含**AQCat25 数据集**。AQCat25-EV2 模型可通过[此处](https://huggingface.co/SandboxAQ/aqcat25-ev2)获取。 AQCat25 数据集收录了超1350万条**密度泛函理论(Density Functional Theory, DFT)**计算轨迹,涵盖约5000种材料与47000个中间体-催化体系组合。本数据集旨在弥补现有大规模数据集的不足,提供**高保真度**计算结果,并纳入了关键的**自旋极化(spin-polarized)**体系——这类体系对精准建模诸多工业相关催化剂至关重要。 如需了解该数据集与模型的更多影响细节,请参阅我们的[官方网站](https://www.sandboxaq.com/aqcat25)与[研究论文](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf)。 ## 1. AQCat25 数据集详情 本仓库采用混合存储方案,为每个数据子集提供轻量可查询的**Parquet格式文件(Parquet)**,同时附带原始**原子模拟环境(Atomic Simulation Environment, ASE)**数据库文件的`.tar.gz`压缩包。更多细节如下文所述。 ### 可查询元数据(Parquet格式文件) Parquet格式文件集合充当该数据集的“目录”,可直接通过`datasets`库加载,实现快速浏览与筛选。每个文件包含以下列: | 列名 | 数据类型 | 描述 | 示例 | | :--- | :--- | :--- | :--- | | `frame_id` | 字符串 | **数据集唯一标识符**,格式为`database_name::index`。 | `data.0015.aselmdb::42` | | `adsorption_energy`| 浮点数 | **核心目标变量**,计算得到的吸附能,单位为电子伏特(eV)。 | -1.542 | | `total_energy` | 浮点数 | 吸附-平板体系的DFT原始总能量,单位为eV。 | -567.123 | | `fmax` | 浮点数 | 体系内任意单原子的最大受力幅值,单位为eV/Å。 | 0.028 | | `is_spin_off` | 布尔值 | 若体系无磁性(VASP(Vienna Ab initio Simulation Package)的ISPIN=1),则为`True`。 | `false` | | `mag` | 浮点数 | 体系总磁化强度,单位为玻尔磁子(µB)。 | 32.619 | | `slab_id` | 字符串 | 洁净表面平板结构的标识符。 | `mp-1216478_001_2_False` | | `adsorbate` | 字符串 | 吸附质的SMILES表达式或化学分子式。 | `*NH2N(CH3)2` | | `is_rerun` | 布尔值 | 若该计算为续算任务,则为`True`。 | `false` | | `is_md` | 布尔值 | 若该帧来自分子动力学模拟,则为`True`。 | `false` | | `sid` | 字符串 | 源数据中的原始系统ID。 | `vadslabboth_82` | | `fid` | 整数 | 源VASP计算中的原始帧索引(步数)。 | 0 | --- #### 理解`frame_id`与`fid` | 字段 | 用途 | 示例 | | :--- | :--- | :--- | | `fid` | **原始帧索引**:对应原始VASP弛豫计算的步数(`ionic_steps`),用于定位该帧在源模拟中的具体来源位置。 | `4`(某特定VASP运行的第5帧) | | `frame_id` | **唯一数据集指针**:为本数据集专门生成的新ID,可精准定位对应的文件(如`data.0015.aselmdb`)与行号(如`101`)以获取完整原子结构。 | `data.0015.aselmdb::101` | --- ## 可下载数据归档包 各数据子集的完整原始数据可通过`.tar.gz`压缩包下载,下表提供直接下载链接。各子集的可查询Parquet文件可直接通过`datasets`库加载,具体用法详见“示例用法”章节。 当前可下载的数据(共计约1110万条轨迹,如下表所列)为2025年9月10日发布的首个数据集版本(v1.0)。论文与引言中提及的1350万条轨迹总量,还包含了用于平衡非磁性元素体系与新增低保真度自旋极化数据集的额外数据。这些新增数据子集将尽快上架本仓库。 | 数据子集名称 | 结构数量 | 归档包大小 | 下载链接 | | :--- | :--- | :--- | :--- | | ***域内(ID)*** | | | | | 训练集 | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) | | 验证集 | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) | | 测试集 | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) | | 平板结构集 | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) | | ***分布外(OOD)验证集*** | | | | | 分布外吸附质(验证) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) | | 分布外材料(验证) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) | | 分布外吸附质与材料(验证) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) | | 分布外平板结构(验证) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) | | ***分布外(OOD)测试集*** | | | | | 分布外吸附质(测试) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) | | 分布外材料(测试) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) | | 分布外吸附质与材料(测试) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) | | 分布外平板结构(测试) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) | --- ## 2. 数据集使用指南 本指南概述了访问与查询AQCat25数据集的推荐流程。 ### 2.1 初始配置 使用前需安装必要的依赖库并完成Hugging Face身份验证,该配置仅需执行一次。 bash pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends **1. 创建Hugging Face账号:** 若尚无账号,请前往[huggingface.co](https://huggingface.co/join)注册。 **2. 创建访问令牌:** 前往"设置 -> 访问令牌"页面,或点击[此处](https://huggingface.co/settings/tokens),创建至少具备**`read`**权限的新令牌,并将其复制到剪贴板。 **3. 通过命令行登录:** 打开终端并运行以下命令: bash hf auth login ### 2.2 获取辅助脚本 您可直接从本仓库复制脚本,或在本地Python环境中运行以下代码下载: python from huggingface_hub import snapshot_download snapshot_download( repo_id="SandboxAQ/aqcat25", repo_type="dataset", allow_patterns=["scripts/*", "README.md"], local_dir="./aqcat25" ) 该命令将创建名为`aqcat25`的本地文件夹,其中包含`scripts/`目录。 ### 2.3 下载所需数据子集 可通过Hugging Face界面直接下载数据子集,或通过`download_split.py`脚本(位于`aqcat25/scripts/`)下载。 bash python aqcat25/scripts/download_split.py --split val_id 该命令将下载`val_id.tar.gz`并解压至`aqcat_data/val_id/`文件夹。 ### 2.4 查询数据集 使用`query_aqcat.py`脚本可筛选数据集并提取所需的原子结构。该脚本会先查询Hub上的元数据,再从本地下载的文件中提取完整结构。 **示例1:查找测试集中所有CO与OH吸附质的结构:** bash python aqcat25/scripts/query_aqcat.py --split test_id --adsorbates "*CO" "*OH" --data-root ./aqcat_data/test_id **示例2:查找吸附能较低的金属表面平板结构:** bash python aqcat25/scripts/query_aqcat.py --split val_ood_both --max-energy -2.0 --material-type nonmetal --magnetism magnetic --data-root ./aqcat_data/val_ood_both --output-file low_energy_metals.extxyz **示例3:查找在同时包含Ni与Se的表面平板上的CO吸附结构,且吸附能介于-2.5至-1.5 eV之间,晶面指数为011:** bash python aqcat25/scripts/query_aqcat.py --split val_ood_ads --adsorbates "*COCH2OH" --min-energy -2.5 --max-energy -1.5 --contains-elements "Ni" "Se" --element-filter-mode all --facet 011 --data-root ./aqcat_data/val_ood_ads --output-file COCH2OH_on_ni_and_se.extxyz --- ## 3. 引用方式 若您在研究中使用了AQCat25数据集或相关模型,请引用以下论文: Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis. arXiv preprint arXiv:XXXX.XXXXX. ### BibTeX引用格式 bibtex @article{allam2025aqcat25, title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}}, author={Allam, Omar and Wander, Brook and Singh, Aayush R}, journal={arXiv preprint arXiv:2510.22938}, year={2025}, eprint={2510.22938}, archivePrefix={arXiv}, primaryClass={cond-mat.mtrl-sci} }
提供机构:
maas
创建时间:
2025-09-10
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
AQCat25数据集是一个大规模、高保真度的计算化学数据集,包含约1350万个DFT计算轨迹,覆盖约5000种材料和47000个中间-催化剂系统。其关键特点是提供了更高精度的计算并包含自旋极化系统,这对于准确建模工业相关异相催化剂至关重要,旨在支持机器学习势能模型的开发。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作