five

Orbital-Materials/MofasaDB

收藏
Hugging Face2025-12-01 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Orbital-Materials/MofasaDB
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 tags: - chemistry - materials - diffusion - synthetic pretty_name: MofasaDB size_categories: - 100K<n<1M --- # MofasaDB The MofasaDB is a publicly available dataset containing 200.000+ _de novo_ generated MOF (Metal-Organic Framework) structures from Mofasa trained on QMOF (up to 170 atoms), along with their geometry-relaxed counterparts. The database is released alongside the paper [Mofasa: A Step Change in Metal-Organic Framework Generation](https://mofux.ai/MOFASA.pdf). A user-friendly web interface for search and discovery can be accessed at [https://mofux.ai/](https://mofux.ai/). --- ## Table of Contents 1. [Database Overview](#database-overview) 2. [Quick Start](#quick-start) 3. [Property Reference](#property-reference) 4. [Structural Properties](#structural-properties) 5. [MOFID Properties](#mofid-properties) 6. [Zeo++ Geometric Properties](#zeo-geometric-properties) 7. [ORB Properties](#orb-properties) 8. [MOFChecker Properties](#mofchecker-properties) 9. [MOF Fragment Properties](#mof-fragment-properties) 10. [Linker Properties](#linker-properties) 11. [Validation Metrics](#validation-metrics) --- ## Database Overview The database contains unconditionally generated MOF structures from Mofasa, along with their geometry-relaxed counterparts. ### Files | File | Description | |------|-------------| | `samples.db` | Original generated MOF structures | | `relaxed.db` | Geometry-relaxed versions of the samples | | `sample_latents/` | ORB latent embeddings for samples | | `relaxed_latents/` | ORB latent embeddings for relaxed structures | ### Data Alignment The databases are **row-aligned**: row *i* in `samples.db` corresponds to row *i* in `relaxed.db`. **Indexing:** - ASE databases are **1-indexed**: first row is `db.get(1)` - NumPy arrays are **0-indexed**: first element is `array[0]` - Therefore: `latent[i]` corresponds to `db.get(i + 1)` --- ## Quick Start ### Load a Structure ```python from ase.db import connect db = connect("samples.db") row_id = 1 row = db.get(row_id) # Get first structure (1-indexed) atoms = row.toatoms() # Convert to ASE Atoms object print(atoms.get_chemical_formula()) ``` ### Access Properties ```python # Get energy per atom energy = row.data['properties']['orb_properties']['orb_energy_per_atom'] # Get pore diameter lcd = row.data['properties']['pyzeo_geometric_properties']['lcd'] # Get topology (top-level property) topology = row.data['topology'] ``` ### Load Orb Latent Embeddings ```python import numpy as np latents = np.load("sample_latents/orb_latent_4_graph.npy") latent = latents[row_id - 1] # Convert 1-indexed row to 0-indexed array ``` ### Compare Sample and Relaxed ```python sample_db = connect("samples.db") relaxed_db = connect("relaxed.db") # Row i in both databases correspond to the same structure row_id = 100 sample_atoms = sample_db.get(row_id).toatoms() relaxed_atoms = relaxed_db.get(row_id).toatoms() print(f"Sample formula: {sample_atoms.get_chemical_formula()}") print(f"Relaxed formula: {relaxed_atoms.get_chemical_formula()}") ``` ### Handle Missing Data Not all properties are available for every structure. Common causes include: - **MOFID failure**: If MOFID cannot identify the MOF building blocks (nodes, linkers, topology), these properties are set to `"UNKNOWN"`, `"ERROR"`, or empty lists for missing SMILES strings. - **Zeo++ non-porous**: If Zeo++ determines a structure has insufficient porosity for probe access, geometric properties (`lcd`, `pld`, accessible volume/surface area) may be missing, zero, or `None`. - **Component absence**: Latent embeddings for `bound_solvent` and `free_solvent` are zero vectors when structures contain no solvent molecules. --- ## Property Reference Properties are stored in `row.data` with nested paths. Some examples: ```python PROPERTY_PATHS = { # ORB model properties 'orb_energy_per_atom': 'properties.orb_properties.orb_energy_per_atom', 'orb_max_force': 'properties.orb_properties.orb_max_force', # Zeo++ geometric properties 'lcd': 'properties.pyzeo_geometric_properties.lcd', 'pld': 'properties.pyzeo_geometric_properties.pld', 'dif': 'properties.pyzeo_geometric_properties.dif', 'av_volume_fraction': 'properties.pyzeo_geometric_properties.av_volume_fraction', 'av_cm3_per_g': 'properties.pyzeo_geometric_properties.av_cm3_per_g', 'nav_volume_fraction': 'properties.pyzeo_geometric_properties.nav_volume_fraction', 'asa_m2_per_g': 'properties.pyzeo_geometric_properties.asa_m2_per_g', 'number_of_channels': 'properties.pyzeo_geometric_properties.number_of_channels', 'number_of_pockets': 'properties.pyzeo_geometric_properties.number_of_pockets', # Crystal symmetry 'spacegroup_number': 'properties.crystal_symmetry.symprec_0.01/spacegroup_number', 'pointgroup': 'properties.crystal_symmetry.symprec_0.01/pointgroup', # MOFID properties 'mofid': 'mofid', 'mofkey': 'mofkey', 'topology': 'topology', 'smiles_nodes': 'smiles_nodes', 'smiles_linkers': 'smiles_linkers', 'cat': 'cat', # MOFChecker 'mofchecker': 'properties.mofchecker', 'mofchecker_valid': 'properties.mofchecker.mofchecker_valid', } ``` --- ## Structural Properties ### Lattice Parameters | Key | Type | Description | |-----|------|-------------| | `lattice_a` | float | Unit cell length along the *a*-axis (Å) | | `lattice_b` | float | Unit cell length along the *b*-axis (Å) | | `lattice_c` | float | Unit cell length along the *c*-axis (Å) | | `lattice_alpha` | float | Angle between *b* and *c* axes (degrees) | | `lattice_beta` | float | Angle between *a* and *c* axes (degrees) | | `lattice_gamma` | float | Angle between *a* and *b* axes (degrees) | ### Chemical Composition | Key | Type | Description | |-----|------|-------------| | `reduced_formula` | str | Empirical (reduced) chemical formula of the structure | --- ## MOFID Properties MOFID is a standardized identifier for MOF structures that encodes topology, nodes, linkers, and catenation information. | Key | Type | Description | |-----|------|-------------| | `mofid` | str | Full MOFID identifier string. Format: `{nodes}.{linkers} MOFid-v1.{topology}.cat{n}`. | | `mofkey` | str | MOFKey identifier (a hash-based representation of the MOF structure). Format: `{hash}.{topology}.MOFkey-v1.{short_code}`. | | `smiles_nodes` | str | Concatenated SMILES strings of all distinct metal nodes (`.`-separated). | | `smiles_linkers` | str | Concatenated SMILES strings of all distinct organic linkers (`.`-separated). | | `topology` | str | Three-letter RCSR topology code (e.g., `"pcu"`, `"dia"`, `"fcu"`). | | `topology_v2` | str | Alternative topology assignment (may differ from primary if ambiguous) | | `cat` | int | Catenation number (degree of interpenetration). 0 = non-catenated, n = n-fold catenated | --- ## Crystal Symmetry Computed using [pymatgen](https://pymatgen.org/)'s `SpacegroupAnalyzer`. | Key | Type | Description | |-----|------|-------------| | `spacegroup` | str | Crystal system from space group analysis at `symprec=0.01` (e.g., `"cubic"`, `"triclinic"`) | | `spacegroup_v2` | str | Crystal system from space group analysis at `symprec=0.1` (more tolerant symmetry detection) | ### Detailed Crystal Symmetry (nested under `properties.crystal_symmetry`) | Key | Type | Description | |-----|------|-------------| | `symprec_0.01/pointgroup` | str | Point group symbol (Hermann-Mauguin notation) | | `symprec_0.01/spacegroup` | str | Space group symbol (Hermann-Mauguin notation) | | `symprec_0.01/spacegroup_number` | int | International Tables space group number (1-230) | | `symprec_0.01/spacegroup_crystal` | str | Crystal system name | | `symprec_0.1/pointgroup` | str | Point group symbol (at looser tolerance) | | `symprec_0.1/spacegroup` | str | Space group symbol (at looser tolerance) | | `symprec_0.1/spacegroup_number` | int | Space group number (at looser tolerance) | | `symprec_0.1/spacegroup_crystal` | str | Crystal system name (at looser tolerance) | --- ## Zeo++ Geometric Properties Computed using [Zeo++](http://www.zeoplusplus.org/) via the pyzeo wrapper. These properties characterize the pore geometry and accessibility using a spherical probe (default: N₂ probe radius of 1.86 Å). ### Pore Descriptors | Key | Type | Unit | Description | |-----|------|------|-------------| | `lcd` | float | Å | **Largest Cavity Diameter** – Diameter of the largest sphere that can fit in the pore without overlapping framework atoms | | `pld` | float | Å | **Pore Limiting Diameter** – Diameter of the largest sphere that can percolate through the framework (i.e., the narrowest point along the largest channel) | | `dif` | float | Å | **Diameter of Included sphere along Free path** – Diameter of the largest sphere that can diffuse along the accessible path | | `number_of_channels` | int | — | Number of distinct connected channel systems in the framework | | `number_of_pockets` | int | — | Number of isolated pores (inaccessible to the probe molecule) | ### Volume Properties | Key | Type | Unit | Description | |-----|------|------|-------------| | `av_volume_fraction` | float | — | Fraction of unit cell volume that is accessible to the probe | | `av_cm3_per_g` | float | cm³/g | Accessible pore volume per gram of framework | | `nav_volume_fraction` | float | — | Fraction of unit cell volume that is non-accessible (pocket volume) | | `nav_cm3_per_g` | float | cm³/g | Non-accessible volume per gram of framework | | `channel_volume_fraction` | float | — | Fraction of total void volume that belongs to channels | | `pocket_volume_fraction` | float | — | Fraction of total void volume that belongs to pockets | ### Surface Area Properties | Key | Type | Unit | Description | |-----|------|------|-------------| | `asa_m2_per_cm3` | float | m²/cm³ | Accessible surface area per unit volume | | `asa_m2_per_g` | float | m²/g | **Accessible Surface Area** per gram (comparable to BET surface area) | | `nasa_m2_per_cm3` | float | m²/cm³ | Non-accessible surface area per unit volume | | `nasa_m2_per_g` | float | m²/g | Non-accessible surface area per gram | | `channel_surface_area_fraction` | float | — | Fraction of total surface area belonging to channels | | `pocket_surface_area_fraction` | float | — | Fraction of total surface area belonging to pockets | --- ## ORB Properties Properties computed using the [ORB](https://github.com/orbital-materials/orb-models) machine-learned interatomic potential. ### Energy and Forces | Key | Type | Unit | Description | |-----|------|------|-------------| | `orb_energy_per_atom` | float | eV/atom | Total predicted potential energy divided by number of atoms | | `orb_max_force` | float | eV/Å | Maximum force magnitude on any atom in the structure | ### ORB Latent Embeddings ORB latent embeddings are stored as NumPy files in the `sample_latents/` and `relaxed_latents/` directories. **File naming:** `orb_latent_{layer}_{component}.npy` | File Pattern | Shape | Description | |--------------|-------|-------------| | `orb_latent_{0-4}_graph` | (N, 256) | Graph-level pooled latent | | `orb_latent_{0-4}_nodes_and_bridges` | (N, 256) | Mean-pooled over metal nodes | | `orb_latent_{0-4}_linkers` | (N, 256) | Mean-pooled over organic linkers | | `orb_latent_{0-4}_bound_solvent` | (N, 256) | Mean-pooled over bound solvents | | `orb_latent_{0-4}_free_solvent` | (N, 256) | Mean-pooled over free solvents | - Layers 0-4 correspond to different depths in the ORB GNN (layer 4 = final layer) - **Zero vectors** indicate missing data (e.g., structures without solvents) --- ## MOFChecker Properties Computed using [MOFChecker](https://github.com/kjappelbaum/mofchecker), a tool for validating MOF structures. All keys are prefixed with `mofchecker_`. ### Validity Checks (Binary) These descriptors are used to determine overall MOF validity. **True indicates a problem** (except where noted). | Key | Type | Description | |-----|------|-------------| | `mofchecker_valid` | bool | Overall validity flag. `True` if structure passes all validity checks. | | `mofchecker_no_carbon` | bool | `True` if structure contains no carbon atoms (invalid for organic-based MOFs) | | `mofchecker_no_hydrogen` | bool | `True` if structure contains no hydrogen atoms | | `mofchecker_no_metal` | bool | `True` if structure contains no metal atoms | | `mofchecker_has_atomic_overlaps` | bool | `True` if any atoms are too close together | | `mofchecker_has_lone_molecule` | bool | `True` if structure contains disconnected molecular fragments | | `mofchecker_has_overcoordinated_c` | bool | `True` if any carbon has too many bonds | | `mofchecker_has_overcoordinated_n` | bool | `True` if any nitrogen has too many bonds | | `mofchecker_has_overcoordinated_h` | bool | `True` if any hydrogen has too many bonds | | `mofchecker_has_undercoordinated_c` | bool | `True` if any carbon has too few bonds | | `mofchecker_has_undercoordinated_n` | bool | `True` if any nitrogen has too few bonds | | `mofchecker_has_undercoordinated_rare_earth` | bool | `True` if any rare earth metal is undercoordinated | | `mofchecker_has_undercoordinated_alkali_alkaline` | bool | `True` if any alkali/alkaline earth metal is undercoordinated | | `mofchecker_has_suspicious_terminal_oxo` | bool | `True` if structure has potentially incorrect terminal oxo groups on metals | | `mofchecker_has_geometrically_exposed_metal` | bool | `True` if any metal has unusual coordination geometry | | `mofchecker_has_high_charges` | bool | `True` if computed partial charges are unusually high | ### Informative Checks (Binary, not used for validity) | Key | Type | Description | |-----|------|-------------| | `mofchecker_has_oms` | bool | `True` if structure has Open Metal Sites (coordinatively unsaturated metals) | | `mofchecker_has_3d_connected_graph` | bool | `True` if the framework is 3D-connected (expected for MOFs) | ### Structure Hashes | Key | Type | Description | |-----|------|-------------| | `mofchecker_graph_hash` | str | Hash of the full structure graph (atoms + bonds) | | `mofchecker_undecorated_graph_hash` | str | Hash of graph with hydrogen atoms removed | | `mofchecker_decorated_scaffold_hash` | str | Hash of framework scaffold with decorations | | `mofchecker_undecorated_scaffold_hash` | str | Hash of bare framework scaffold | | `mofchecker_symmetry_hash` | str | Hash encoding symmetry information | --- ## MOF Fragment Properties Properties of the decomposed MOF components (nodes, linkers, solvents). Stored under `properties.mof_fragments`. ### Component Types MOF structures are decomposed into four component types: - **nodes_and_bridges**: Metal nodes and bridging groups - **linkers**: Organic linker molecules - **bound_solvent**: Solvent molecules coordinated to metal centers - **free_solvent**: Unbound solvent molecules in pores ### Fragment Formulas | Key | Type | Description | |-----|------|-------------| | `{component}_formulas` | List[str] | Chemical formulas of each fragment of this component type | *Example: `nodes_and_bridges_formulas = ["Zn4O", "Zn4O"]` for a structure with two identical zinc nodes* ### Linker SMILES | Key | Type | Description | |-----|------|-------------| | `linkers_smiles` | List[str] | Full SMILES strings for each linker fragment, including stereochemistry and charges where applicable | | `linkers_simple_smiles` | List[str] | Simplified SMILES (scaffold only, no stereochemistry). More robust for parsing but less chemically accurate | --- ## Linker Properties Molecular descriptors and fingerprints for organic linker molecules. Stored under `properties.linker_properties`. ### Morgan Fingerprints Morgan (circular) fingerprints are stored as NumPy files. For similarity search, use the standardized versions. | File | Description | |------|-------------| | `linkers_morgan_ecfp4.npy` | ECFP4 (radius=2), 2048-bit | | `linkers_morgan_ecfp6.npy` | ECFP6 (radius=3), 2048-bit | | `linkers_morgan_ecfp4_standardized.npy` | ECFP4 from standardized molecules | | `linkers_morgan_ecfp6_standardized.npy` | ECFP6 from standardized molecules | **Scalar metadata:** | Key | Type | Description | |-----|------|-------------| | `linkers_smiles_used` | List[str] | Which SMILES string was successfully parsed for each linker (original, fixed, or simple) | | `linkers_smiles_standardized` | List[str] | Chemically standardized SMILES (neutralized, canonical tautomer) | | `linkers_morgan_count_sum` | List[int] | Sum of Morgan fingerprint bit counts (molecular complexity proxy) | | `linkers_morgan_count_sum_max` | List[int] | Maximum count in Morgan fingerprint (indicates highly represented substructures) | | `linkers_morgan_count_sum_standardized` | List[int] | Sum of counts for standardized fingerprints | | `linkers_morgan_count_sum_max_standardized` | List[int] | Maximum count for standardized fingerprints | ### Molecular Descriptors Computed on standardized molecules using RDKit. | Key | Type | Description | |-----|------|-------------| | `linkers_rotatable_bonds` | List[int] | Number of rotatable bonds per linker (flexibility metric) | | `linkers_ring_count` | List[int] | Number of rings per linker | ### Coordination Site Descriptors Counts of metal-coordinating functional groups (computed on as-parsed molecules). | Key | Type | Description | |-----|------|-------------| | `linkers_coordination_site_count` | List[int] | Total number of potential metal coordination sites per linker | | `linkers_coordination_site_breakdown` | List[Dict] | Breakdown by coordination site type | | `linkers_carboxylate_count` | List[int] | Number of carboxylate groups (-COO⁻/-COOH) | | `linkers_pyridine_count` | List[int] | Number of aromatic nitrogen sites | | `linkers_imidazole_n_count` | List[int] | Number of imidazole/triazole NH groups | | `linkers_primary_amine_count` | List[int] | Number of primary amine groups (-NH₂) | | `linkers_secondary_amine_count` | List[int] | Number of secondary amine groups (-NH-) | | `linkers_tertiary_amine_count` | List[int] | Number of tertiary amine groups (-N<) | | `linkers_phosphonate_count` | List[int] | Number of phosphonate groups | | `linkers_sulfonate_count` | List[int] | Number of sulfonate groups | | `linkers_phenolic_oh_count` | List[int] | Number of phenolic hydroxyl groups | | `linkers_alcoholic_oh_count` | List[int] | Number of alcoholic hydroxyl groups | | `linkers_thiol_count` | List[int] | Number of thiol groups (-SH) | | `linkers_nitrile_count` | List[int] | Number of nitrile groups (-C≡N) | --- ## Validation Metrics Binary metrics used to assess structure quality. | Key | Type | Description | |-----|------|-------------| | `no_atom_too_close` | bool | `True` if all interatomic distances are physically reasonable | | `smact_valid` | bool | `True` if composition passes SMACT electronegativity/charge balance checks | | `reconstruction_failed` | bool | `True` if structure reconstruction from latent space failed | --- ## License [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en) ## References - **MOFID**: Bucior, B. J., et al. (2019). [Identification Schemes for Metal-Organic Frameworks...](https://doi.org/10.1021/acs.cgd.9b00050) - **Zeo++**: Willems, T. F., et al. (2012). [Algorithms and tools for high-throughput geometry-based analysis...](https://doi.org/10.1016/j.micromeso.2011.08.020) - **MOFChecker**: Ongari, D., et al. (2019). [Building a Consistent and Reproducible Database for Adsorption Evaluation...](https://doi.org/10.1021/acscentsci.9b00619) - **QMOF** Andrew S. R., et al. (2021). Paper can be found at [Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery](https://doi.org/10.1016/j.matt.2021.02.015) and corresponding dataset release on [GitHub](https://github.com/Andrew-S-Rosen/QMOF) - **ORB**: [Orbital ORB v3 Force Field](https://github.com/orbital-materials/orb-models) - **RDKit Morgan Fingerprints**: [RDKit Documentation](https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints)
提供机构:
Orbital-Materials
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作