Orbital-Materials/MofasaDB
收藏Hugging Face2025-12-01 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Orbital-Materials/MofasaDB
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
tags:
- chemistry
- materials
- diffusion
- synthetic
pretty_name: MofasaDB
size_categories:
- 100K<n<1M
---
# MofasaDB
The MofasaDB is a publicly available dataset containing 200.000+ _de novo_ generated MOF (Metal-Organic Framework) structures from Mofasa trained on QMOF (up to 170 atoms), along with their geometry-relaxed counterparts. The database is released alongside the paper [Mofasa: A Step Change in Metal-Organic Framework Generation](https://mofux.ai/MOFASA.pdf). A user-friendly web interface for search and discovery can be accessed at [https://mofux.ai/](https://mofux.ai/).
---
## Table of Contents
1. [Database Overview](#database-overview)
2. [Quick Start](#quick-start)
3. [Property Reference](#property-reference)
4. [Structural Properties](#structural-properties)
5. [MOFID Properties](#mofid-properties)
6. [Zeo++ Geometric Properties](#zeo-geometric-properties)
7. [ORB Properties](#orb-properties)
8. [MOFChecker Properties](#mofchecker-properties)
9. [MOF Fragment Properties](#mof-fragment-properties)
10. [Linker Properties](#linker-properties)
11. [Validation Metrics](#validation-metrics)
---
## Database Overview
The database contains unconditionally generated MOF structures from Mofasa, along with their geometry-relaxed counterparts.
### Files
| File | Description |
|------|-------------|
| `samples.db` | Original generated MOF structures |
| `relaxed.db` | Geometry-relaxed versions of the samples |
| `sample_latents/` | ORB latent embeddings for samples |
| `relaxed_latents/` | ORB latent embeddings for relaxed structures |
### Data Alignment
The databases are **row-aligned**: row *i* in `samples.db` corresponds to row *i* in `relaxed.db`.
**Indexing:**
- ASE databases are **1-indexed**: first row is `db.get(1)`
- NumPy arrays are **0-indexed**: first element is `array[0]`
- Therefore: `latent[i]` corresponds to `db.get(i + 1)`
---
## Quick Start
### Load a Structure
```python
from ase.db import connect
db = connect("samples.db")
row_id = 1
row = db.get(row_id) # Get first structure (1-indexed)
atoms = row.toatoms() # Convert to ASE Atoms object
print(atoms.get_chemical_formula())
```
### Access Properties
```python
# Get energy per atom
energy = row.data['properties']['orb_properties']['orb_energy_per_atom']
# Get pore diameter
lcd = row.data['properties']['pyzeo_geometric_properties']['lcd']
# Get topology (top-level property)
topology = row.data['topology']
```
### Load Orb Latent Embeddings
```python
import numpy as np
latents = np.load("sample_latents/orb_latent_4_graph.npy")
latent = latents[row_id - 1] # Convert 1-indexed row to 0-indexed array
```
### Compare Sample and Relaxed
```python
sample_db = connect("samples.db")
relaxed_db = connect("relaxed.db")
# Row i in both databases correspond to the same structure
row_id = 100
sample_atoms = sample_db.get(row_id).toatoms()
relaxed_atoms = relaxed_db.get(row_id).toatoms()
print(f"Sample formula: {sample_atoms.get_chemical_formula()}")
print(f"Relaxed formula: {relaxed_atoms.get_chemical_formula()}")
```
### Handle Missing Data
Not all properties are available for every structure. Common causes include:
- **MOFID failure**: If MOFID cannot identify the MOF building blocks (nodes, linkers, topology), these properties are set to `"UNKNOWN"`, `"ERROR"`, or empty lists for missing SMILES strings.
- **Zeo++ non-porous**: If Zeo++ determines a structure has insufficient porosity for probe access, geometric properties (`lcd`, `pld`, accessible volume/surface area) may be missing, zero, or `None`.
- **Component absence**: Latent embeddings for `bound_solvent` and `free_solvent` are zero vectors when structures contain no solvent molecules.
---
## Property Reference
Properties are stored in `row.data` with nested paths. Some examples:
```python
PROPERTY_PATHS = {
# ORB model properties
'orb_energy_per_atom': 'properties.orb_properties.orb_energy_per_atom',
'orb_max_force': 'properties.orb_properties.orb_max_force',
# Zeo++ geometric properties
'lcd': 'properties.pyzeo_geometric_properties.lcd',
'pld': 'properties.pyzeo_geometric_properties.pld',
'dif': 'properties.pyzeo_geometric_properties.dif',
'av_volume_fraction': 'properties.pyzeo_geometric_properties.av_volume_fraction',
'av_cm3_per_g': 'properties.pyzeo_geometric_properties.av_cm3_per_g',
'nav_volume_fraction': 'properties.pyzeo_geometric_properties.nav_volume_fraction',
'asa_m2_per_g': 'properties.pyzeo_geometric_properties.asa_m2_per_g',
'number_of_channels': 'properties.pyzeo_geometric_properties.number_of_channels',
'number_of_pockets': 'properties.pyzeo_geometric_properties.number_of_pockets',
# Crystal symmetry
'spacegroup_number': 'properties.crystal_symmetry.symprec_0.01/spacegroup_number',
'pointgroup': 'properties.crystal_symmetry.symprec_0.01/pointgroup',
# MOFID properties
'mofid': 'mofid',
'mofkey': 'mofkey',
'topology': 'topology',
'smiles_nodes': 'smiles_nodes',
'smiles_linkers': 'smiles_linkers',
'cat': 'cat',
# MOFChecker
'mofchecker': 'properties.mofchecker',
'mofchecker_valid': 'properties.mofchecker.mofchecker_valid',
}
```
---
## Structural Properties
### Lattice Parameters
| Key | Type | Description |
|-----|------|-------------|
| `lattice_a` | float | Unit cell length along the *a*-axis (Å) |
| `lattice_b` | float | Unit cell length along the *b*-axis (Å) |
| `lattice_c` | float | Unit cell length along the *c*-axis (Å) |
| `lattice_alpha` | float | Angle between *b* and *c* axes (degrees) |
| `lattice_beta` | float | Angle between *a* and *c* axes (degrees) |
| `lattice_gamma` | float | Angle between *a* and *b* axes (degrees) |
### Chemical Composition
| Key | Type | Description |
|-----|------|-------------|
| `reduced_formula` | str | Empirical (reduced) chemical formula of the structure |
---
## MOFID Properties
MOFID is a standardized identifier for MOF structures that encodes topology, nodes, linkers, and catenation information.
| Key | Type | Description |
|-----|------|-------------|
| `mofid` | str | Full MOFID identifier string. Format: `{nodes}.{linkers} MOFid-v1.{topology}.cat{n}`. |
| `mofkey` | str | MOFKey identifier (a hash-based representation of the MOF structure). Format: `{hash}.{topology}.MOFkey-v1.{short_code}`. |
| `smiles_nodes` | str | Concatenated SMILES strings of all distinct metal nodes (`.`-separated). |
| `smiles_linkers` | str | Concatenated SMILES strings of all distinct organic linkers (`.`-separated). |
| `topology` | str | Three-letter RCSR topology code (e.g., `"pcu"`, `"dia"`, `"fcu"`). |
| `topology_v2` | str | Alternative topology assignment (may differ from primary if ambiguous) |
| `cat` | int | Catenation number (degree of interpenetration). 0 = non-catenated, n = n-fold catenated |
---
## Crystal Symmetry
Computed using [pymatgen](https://pymatgen.org/)'s `SpacegroupAnalyzer`.
| Key | Type | Description |
|-----|------|-------------|
| `spacegroup` | str | Crystal system from space group analysis at `symprec=0.01` (e.g., `"cubic"`, `"triclinic"`) |
| `spacegroup_v2` | str | Crystal system from space group analysis at `symprec=0.1` (more tolerant symmetry detection) |
### Detailed Crystal Symmetry (nested under `properties.crystal_symmetry`)
| Key | Type | Description |
|-----|------|-------------|
| `symprec_0.01/pointgroup` | str | Point group symbol (Hermann-Mauguin notation) |
| `symprec_0.01/spacegroup` | str | Space group symbol (Hermann-Mauguin notation) |
| `symprec_0.01/spacegroup_number` | int | International Tables space group number (1-230) |
| `symprec_0.01/spacegroup_crystal` | str | Crystal system name |
| `symprec_0.1/pointgroup` | str | Point group symbol (at looser tolerance) |
| `symprec_0.1/spacegroup` | str | Space group symbol (at looser tolerance) |
| `symprec_0.1/spacegroup_number` | int | Space group number (at looser tolerance) |
| `symprec_0.1/spacegroup_crystal` | str | Crystal system name (at looser tolerance) |
---
## Zeo++ Geometric Properties
Computed using [Zeo++](http://www.zeoplusplus.org/) via the pyzeo wrapper. These properties characterize the pore geometry and accessibility using a spherical probe (default: N₂ probe radius of 1.86 Å).
### Pore Descriptors
| Key | Type | Unit | Description |
|-----|------|------|-------------|
| `lcd` | float | Å | **Largest Cavity Diameter** – Diameter of the largest sphere that can fit in the pore without overlapping framework atoms |
| `pld` | float | Å | **Pore Limiting Diameter** – Diameter of the largest sphere that can percolate through the framework (i.e., the narrowest point along the largest channel) |
| `dif` | float | Å | **Diameter of Included sphere along Free path** – Diameter of the largest sphere that can diffuse along the accessible path |
| `number_of_channels` | int | — | Number of distinct connected channel systems in the framework |
| `number_of_pockets` | int | — | Number of isolated pores (inaccessible to the probe molecule) |
### Volume Properties
| Key | Type | Unit | Description |
|-----|------|------|-------------|
| `av_volume_fraction` | float | — | Fraction of unit cell volume that is accessible to the probe |
| `av_cm3_per_g` | float | cm³/g | Accessible pore volume per gram of framework |
| `nav_volume_fraction` | float | — | Fraction of unit cell volume that is non-accessible (pocket volume) |
| `nav_cm3_per_g` | float | cm³/g | Non-accessible volume per gram of framework |
| `channel_volume_fraction` | float | — | Fraction of total void volume that belongs to channels |
| `pocket_volume_fraction` | float | — | Fraction of total void volume that belongs to pockets |
### Surface Area Properties
| Key | Type | Unit | Description |
|-----|------|------|-------------|
| `asa_m2_per_cm3` | float | m²/cm³ | Accessible surface area per unit volume |
| `asa_m2_per_g` | float | m²/g | **Accessible Surface Area** per gram (comparable to BET surface area) |
| `nasa_m2_per_cm3` | float | m²/cm³ | Non-accessible surface area per unit volume |
| `nasa_m2_per_g` | float | m²/g | Non-accessible surface area per gram |
| `channel_surface_area_fraction` | float | — | Fraction of total surface area belonging to channels |
| `pocket_surface_area_fraction` | float | — | Fraction of total surface area belonging to pockets |
---
## ORB Properties
Properties computed using the [ORB](https://github.com/orbital-materials/orb-models) machine-learned interatomic potential.
### Energy and Forces
| Key | Type | Unit | Description |
|-----|------|------|-------------|
| `orb_energy_per_atom` | float | eV/atom | Total predicted potential energy divided by number of atoms |
| `orb_max_force` | float | eV/Å | Maximum force magnitude on any atom in the structure |
### ORB Latent Embeddings
ORB latent embeddings are stored as NumPy files in the `sample_latents/` and `relaxed_latents/` directories.
**File naming:** `orb_latent_{layer}_{component}.npy`
| File Pattern | Shape | Description |
|--------------|-------|-------------|
| `orb_latent_{0-4}_graph` | (N, 256) | Graph-level pooled latent |
| `orb_latent_{0-4}_nodes_and_bridges` | (N, 256) | Mean-pooled over metal nodes |
| `orb_latent_{0-4}_linkers` | (N, 256) | Mean-pooled over organic linkers |
| `orb_latent_{0-4}_bound_solvent` | (N, 256) | Mean-pooled over bound solvents |
| `orb_latent_{0-4}_free_solvent` | (N, 256) | Mean-pooled over free solvents |
- Layers 0-4 correspond to different depths in the ORB GNN (layer 4 = final layer)
- **Zero vectors** indicate missing data (e.g., structures without solvents)
---
## MOFChecker Properties
Computed using [MOFChecker](https://github.com/kjappelbaum/mofchecker), a tool for validating MOF structures. All keys are prefixed with `mofchecker_`.
### Validity Checks (Binary)
These descriptors are used to determine overall MOF validity. **True indicates a problem** (except where noted).
| Key | Type | Description |
|-----|------|-------------|
| `mofchecker_valid` | bool | Overall validity flag. `True` if structure passes all validity checks. |
| `mofchecker_no_carbon` | bool | `True` if structure contains no carbon atoms (invalid for organic-based MOFs) |
| `mofchecker_no_hydrogen` | bool | `True` if structure contains no hydrogen atoms |
| `mofchecker_no_metal` | bool | `True` if structure contains no metal atoms |
| `mofchecker_has_atomic_overlaps` | bool | `True` if any atoms are too close together |
| `mofchecker_has_lone_molecule` | bool | `True` if structure contains disconnected molecular fragments |
| `mofchecker_has_overcoordinated_c` | bool | `True` if any carbon has too many bonds |
| `mofchecker_has_overcoordinated_n` | bool | `True` if any nitrogen has too many bonds |
| `mofchecker_has_overcoordinated_h` | bool | `True` if any hydrogen has too many bonds |
| `mofchecker_has_undercoordinated_c` | bool | `True` if any carbon has too few bonds |
| `mofchecker_has_undercoordinated_n` | bool | `True` if any nitrogen has too few bonds |
| `mofchecker_has_undercoordinated_rare_earth` | bool | `True` if any rare earth metal is undercoordinated |
| `mofchecker_has_undercoordinated_alkali_alkaline` | bool | `True` if any alkali/alkaline earth metal is undercoordinated |
| `mofchecker_has_suspicious_terminal_oxo` | bool | `True` if structure has potentially incorrect terminal oxo groups on metals |
| `mofchecker_has_geometrically_exposed_metal` | bool | `True` if any metal has unusual coordination geometry |
| `mofchecker_has_high_charges` | bool | `True` if computed partial charges are unusually high |
### Informative Checks (Binary, not used for validity)
| Key | Type | Description |
|-----|------|-------------|
| `mofchecker_has_oms` | bool | `True` if structure has Open Metal Sites (coordinatively unsaturated metals) |
| `mofchecker_has_3d_connected_graph` | bool | `True` if the framework is 3D-connected (expected for MOFs) |
### Structure Hashes
| Key | Type | Description |
|-----|------|-------------|
| `mofchecker_graph_hash` | str | Hash of the full structure graph (atoms + bonds) |
| `mofchecker_undecorated_graph_hash` | str | Hash of graph with hydrogen atoms removed |
| `mofchecker_decorated_scaffold_hash` | str | Hash of framework scaffold with decorations |
| `mofchecker_undecorated_scaffold_hash` | str | Hash of bare framework scaffold |
| `mofchecker_symmetry_hash` | str | Hash encoding symmetry information |
---
## MOF Fragment Properties
Properties of the decomposed MOF components (nodes, linkers, solvents). Stored under `properties.mof_fragments`.
### Component Types
MOF structures are decomposed into four component types:
- **nodes_and_bridges**: Metal nodes and bridging groups
- **linkers**: Organic linker molecules
- **bound_solvent**: Solvent molecules coordinated to metal centers
- **free_solvent**: Unbound solvent molecules in pores
### Fragment Formulas
| Key | Type | Description |
|-----|------|-------------|
| `{component}_formulas` | List[str] | Chemical formulas of each fragment of this component type |
*Example: `nodes_and_bridges_formulas = ["Zn4O", "Zn4O"]` for a structure with two identical zinc nodes*
### Linker SMILES
| Key | Type | Description |
|-----|------|-------------|
| `linkers_smiles` | List[str] | Full SMILES strings for each linker fragment, including stereochemistry and charges where applicable |
| `linkers_simple_smiles` | List[str] | Simplified SMILES (scaffold only, no stereochemistry). More robust for parsing but less chemically accurate |
---
## Linker Properties
Molecular descriptors and fingerprints for organic linker molecules. Stored under `properties.linker_properties`.
### Morgan Fingerprints
Morgan (circular) fingerprints are stored as NumPy files. For similarity search, use the standardized versions.
| File | Description |
|------|-------------|
| `linkers_morgan_ecfp4.npy` | ECFP4 (radius=2), 2048-bit |
| `linkers_morgan_ecfp6.npy` | ECFP6 (radius=3), 2048-bit |
| `linkers_morgan_ecfp4_standardized.npy` | ECFP4 from standardized molecules |
| `linkers_morgan_ecfp6_standardized.npy` | ECFP6 from standardized molecules |
**Scalar metadata:**
| Key | Type | Description |
|-----|------|-------------|
| `linkers_smiles_used` | List[str] | Which SMILES string was successfully parsed for each linker (original, fixed, or simple) |
| `linkers_smiles_standardized` | List[str] | Chemically standardized SMILES (neutralized, canonical tautomer) |
| `linkers_morgan_count_sum` | List[int] | Sum of Morgan fingerprint bit counts (molecular complexity proxy) |
| `linkers_morgan_count_sum_max` | List[int] | Maximum count in Morgan fingerprint (indicates highly represented substructures) |
| `linkers_morgan_count_sum_standardized` | List[int] | Sum of counts for standardized fingerprints |
| `linkers_morgan_count_sum_max_standardized` | List[int] | Maximum count for standardized fingerprints |
### Molecular Descriptors
Computed on standardized molecules using RDKit.
| Key | Type | Description |
|-----|------|-------------|
| `linkers_rotatable_bonds` | List[int] | Number of rotatable bonds per linker (flexibility metric) |
| `linkers_ring_count` | List[int] | Number of rings per linker |
### Coordination Site Descriptors
Counts of metal-coordinating functional groups (computed on as-parsed molecules).
| Key | Type | Description |
|-----|------|-------------|
| `linkers_coordination_site_count` | List[int] | Total number of potential metal coordination sites per linker |
| `linkers_coordination_site_breakdown` | List[Dict] | Breakdown by coordination site type |
| `linkers_carboxylate_count` | List[int] | Number of carboxylate groups (-COO⁻/-COOH) |
| `linkers_pyridine_count` | List[int] | Number of aromatic nitrogen sites |
| `linkers_imidazole_n_count` | List[int] | Number of imidazole/triazole NH groups |
| `linkers_primary_amine_count` | List[int] | Number of primary amine groups (-NH₂) |
| `linkers_secondary_amine_count` | List[int] | Number of secondary amine groups (-NH-) |
| `linkers_tertiary_amine_count` | List[int] | Number of tertiary amine groups (-N<) |
| `linkers_phosphonate_count` | List[int] | Number of phosphonate groups |
| `linkers_sulfonate_count` | List[int] | Number of sulfonate groups |
| `linkers_phenolic_oh_count` | List[int] | Number of phenolic hydroxyl groups |
| `linkers_alcoholic_oh_count` | List[int] | Number of alcoholic hydroxyl groups |
| `linkers_thiol_count` | List[int] | Number of thiol groups (-SH) |
| `linkers_nitrile_count` | List[int] | Number of nitrile groups (-C≡N) |
---
## Validation Metrics
Binary metrics used to assess structure quality.
| Key | Type | Description |
|-----|------|-------------|
| `no_atom_too_close` | bool | `True` if all interatomic distances are physically reasonable |
| `smact_valid` | bool | `True` if composition passes SMACT electronegativity/charge balance checks |
| `reconstruction_failed` | bool | `True` if structure reconstruction from latent space failed |
---
## License
[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
## References
- **MOFID**: Bucior, B. J., et al. (2019). [Identification Schemes for Metal-Organic Frameworks...](https://doi.org/10.1021/acs.cgd.9b00050)
- **Zeo++**: Willems, T. F., et al. (2012). [Algorithms and tools for high-throughput geometry-based analysis...](https://doi.org/10.1016/j.micromeso.2011.08.020)
- **MOFChecker**: Ongari, D., et al. (2019). [Building a Consistent and Reproducible Database for Adsorption Evaluation...](https://doi.org/10.1021/acscentsci.9b00619)
- **QMOF** Andrew S. R., et al. (2021). Paper can be found at [Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery](https://doi.org/10.1016/j.matt.2021.02.015) and corresponding dataset release on [GitHub](https://github.com/Andrew-S-Rosen/QMOF)
- **ORB**: [Orbital ORB v3 Force Field](https://github.com/orbital-materials/orb-models)
- **RDKit Morgan Fingerprints**: [RDKit Documentation](https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints)
提供机构:
Orbital-Materials



