LINF_320009700|生物化学数据集|寄生虫学数据集
收藏中国区域交通网络数据集
该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。
data.stats.gov.cn 收录
中国劳动力动态调查
“中国劳动力动态调查” (China Labor-force Dynamics Survey,简称 CLDS)是“985”三期“中山大学社会科学特色数据库建设”专项内容,CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查,系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响,建立劳动力、家庭和社区三个层次上的追踪数据库,从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。
中国学术调查数据资料库 收录
Wind Turbine Data
该数据集包含风力涡轮机的运行数据,包括风速、风向、发电量等参数。数据记录了多个风力涡轮机在不同时间点的运行状态,适用于风能研究和风力发电系统的优化分析。
www.kaggle.com 收录
CODrone
CODrone 是一个为无人机设计的全面定向目标检测数据集,它准确反映了真实世界条件。该数据集包含来自多个城市在不同光照条件下的广泛标注图像,增强了基准的逼真度。CODrone 包含超过 10,000 张高分辨率图像,捕获自五个城市的真实无人机飞行,涵盖了各种城市和工业环境,包括港口和码头。为了提高鲁棒性和泛化能力,它包括在正常光线、低光和夜间条件下相同场景的图像。我们采用了三种飞行高度和两种常用的相机角度,从而产生了六个不同的视角配置。所有图像都针对 12 个常见对象类别进行了定向边界框标注,总计超过 590,000 个标记实例。总体而言,这项工作构建了一个综合数据集和基准,用于城市无人机场景中的定向目标检测,旨在满足该领域的研究和实践应用需求。
arXiv 收录
aqcat25
<h1 align="center" style="font-size: 36px;"> <span style="color: #FFD700;">AQCat25 Dataset:</span> Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis </h1>  This repository contains the **AQCat25 dataset**. AQCat25-EV2 models can be accessed [here](https://huggingface.co/SandboxAQ/aqcat25-ev2). The AQCat25 dataset provides a large and diverse collection of **13.5 million** DFT calculation trajectories, encompassing approximately 5K materials and 47K intermediate-catalyst systems. It is designed to complement existing large-scale datasets by providing calculations at **higher fidelity** and including critical **spin-polarized** systems, which are essential for accurately modeling many industrially relevant catalysts. Please see our [website](https://www.sandboxaq.com/aqcat25) and [paper](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf) for more details about the impact of the dataset and [models](https://huggingface.co/SandboxAQ/aqcat25-ev2). ## 1. AQCat25 Dataset Details This repository uses a hybrid approach, providing lightweight, queryable Parquet files for each split alongside compressed archives (`.tar.gz`) of the raw ASE database files. More details can be found below. ### Queryable Metadata (Parquet Files) A set of Parquet files provides a "table of contents" for the dataset. They can be loaded directly with the `datasets` library for fast browsing and filtering. Each file contains the following columns: | Column Name | Data Type | Description | Example | | :--- | :--- | :--- | :--- | | `frame_id` | string | **Unique ID for this dataset**. Formatted as `database_name::index`. | `data.0015.aselmdb::42` | | `adsorption_energy`| float | **Key Target**. The calculated adsorption energy in eV. | -1.542 | | `total_energy` | float | The raw total energy of the adslab system from DFT (in eV). | -567.123 | | `fmax` | float | The maximum force magnitude on any single atom in eV/Å. | 0.028 | | `is_spin_off` | boolean | `True` if the system is non-magnetic (VASP ISPIN=1). | `false` | | `mag` | float | The total magnetization of the system (µB). | 32.619 | | `slab_id` | string | Identifier for the clean slab structure. | `mp-1216478_001_2_False` | | `adsorbate` | string | SMILES or chemical formula of the adsorbate. | `*NH2N(CH3)2` | | `is_rerun` | boolean | `True` if the calculation is a continuation. | `false` | | `is_md` | boolean | `True` if the frame is from a molecular dynamics run. | `false` | | `sid` | string | The original system ID from the source data. | `vadslabboth_82` | | `fid` | integer | The original frame index (step number) from the source VASP calculation. | 0 | --- #### Understanding `frame_id` and `fid` | Field | Purpose | Example | | :--- | :--- | :--- | | `fid` | **Original Frame Index**: This is the step number from the original VASP relaxation (`ionic_steps`). It tells you where the frame came from in its source simulation. | `4` (the 5th frame of a specific VASP run) | | `frame_id` | **Unique Dataset Pointer**: This is a new ID created for this specific dataset. It tells you exactly which file (`data.0015.aselmdb`) and which row (`101`) to look in to find the full atomic structure. | `data.0015.aselmdb::101` | --- ## Downloadable Data Archives The full, raw data for each split is available for download in compressed `.tar.gz` archives. The table below provides direct download links. The queryable Parquet files for each split can be loaded directly using the `datasets` library as shown in the "Example Usage" section. The data currently available for download (totaling ~11.1M frames, as listed in the table below) is the initial dataset version (v1.0) released on September 10, 2025. The 13.5M frame count mentioned in our paper and the introduction includes additional data used to rebalance non-magnetic element systems and add a low-fidelity spin-on dataset. These new data splits will be added to this repository soon. | Split Name | Structures | Archive Size | Download Link | | :--- | :--- | :--- | :--- | | ***In-Domain (ID)*** | | | | | Train | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) | | Validation | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) | | Test | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) | | Slabs | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Validation*** | | | | | OOD Ads (Val) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) | | OOD Materials (Val) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) | | OOD Both (Val) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) | | OOD Slabs (Val) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Test*** | | | | | OOD Ads (Test) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) | | OOD Materials (Test) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) | | OOD Both (Test) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) | | OOD Slabs (Test) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) | --- ## 2. Dataset Usage Guide This guide outlines the recommended workflow for accessing and querying the AQCat25 dataset. ### 2.1 Initial Setup Before you begin, you need to install the necessary libraries and authenticate with Hugging Face. This is a one-time setup. ```bash pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends ``` **1. Create a Hugging Face Account:** If you don't have one, create an account at [huggingface.co](https://huggingface.co/join). **2. Create an Access Token:** Navigate to your **Settings -> Access Tokens** page or click [here](https://huggingface.co/settings/tokens). Create a new token with at least **`read`** permissions. Copy this token to your clipboard. **3. Log in via the Command Line:** Open your terminal and run the following command: ```bash hf auth login ``` ### 2.2 Get the Helper Scripts You may copy the scripts directly from this repository, or download them by running the following in your local python environment: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="SandboxAQ/aqcat25", repo_type="dataset", allow_patterns=["scripts/*", "README.md"], local_dir="./aqcat25" ) ``` This will create a local folder named aqcat25 containing the scripts/ directory. ### 2.3 Download Desired Dataset Splits Data splits may be downloaded directly via the Hugging Face UI, or via the `download_split.py` script (found in `aqcat25/scripts/`). ```bash python aqcat25/scripts/download_split.py --split val_id ``` This will download `val_id.tar.gz` and extract it to a new folder named `aqcat_data/val_id/`. ### 2.4 Query the Dataset Use the `query_aqcat.py` script to filter the dataset and extract the specific atomic structures you need. It first queries the metadata on the Hub and then extracts the full structures from your locally downloaded files. **Example 1: Find all CO and OH structures in the test set:** ```bash python aqcat25/scripts/query_aqcat.py \ --split test_id \ --adsorbates "*CO" "*OH" \ --data-root ./aqcat_data/test_id ``` **Example 2: Find structures on metal slabs with low adsorption energy:** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_both \ --max-energy -2.0 \ --material-type nonmetal \ --magnetism magnetic \ --data-root ./aqcat_data/val_ood_both \ --output-file low_energy_metals.extxyz ``` **Example 3: Find CO on slabs containing both Ni AND Se with adsorption energy between -2.5 and -1.5 eV with a miller index of 011** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_ads \ --adsorbates "*COCH2OH" \ --min-energy -2.5 \ --max-energy -1.5 \ --contains-elements "Ni" "Se" \ --element-filter-mode all \ --facet 011 \ --data-root ./aqcat_data/val_ood_ads \ --output-file COCH2OH_on_ni_and_se.extxyz ``` --- ## 3. How to Cite If you use the AQCat25 dataset or the models in your research, please cite the following paper: ``` Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis. arXiv preprint arXiv:XXXX.XXXXX. ``` ### BibTeX Entry ```bibtex @article{allam2025aqcat25, title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}}, author={Allam, Omar and Wander, Brook and Singh, Aayush R}, journal={arXiv preprint arXiv:2510.22938}, year={2025}, eprint={2510.22938}, archivePrefix={arXiv}, primaryClass={cond-mat.mtrl-sci} } ```
魔搭社区 收录
