leon14159/moleculenet-unimol-lmdb
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/leon14159/moleculenet-unimol-lmdb
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- chemistry
- molecular-property-prediction
- moleculenet
- uni-mol
size_categories:
- 100K<n<1M
---
# MoleculeNet Uni-Mol LMDB Data
Pre-split MoleculeNet datasets in LMDB format from [Uni-Mol](https://github.com/deepmodeling/Uni-Mol) (ICLR 2023).
## Split Protocol
- **Scaffold splitting** (Bemis-Murcko with `includeChirality=True`)
- **Ratio**: 8:1:1 (train:valid:test)
- Following GEM (Fang et al., 2022) and Uni-Mol (Zhou et al., ICLR 2023)
## Data Format
Each dataset directory contains `train.lmdb`, `valid.lmdb`, and `test.lmdb`. Each LMDB entry is a pickled dict with keys:
- `smi`: SMILES string
- `atoms`: atom type list
- `coordinates`: 3D coordinates array (shape: [num_conformers, num_atoms, 3])
- `target`: label(s)
## Datasets
### Classification (ROC-AUC)
| Dataset | Tasks | Test Size |
|---------|-------|-----------|
| BBBP | 1 | 204 |
| BACE | 1 | 152 |
| ClinTox | 2 | 148 |
| Tox21 | 12 | 784 |
| SIDER | 27 | 143 |
| HIV | 1 | 4,113 |
### Regression (RMSE)
| Dataset | Tasks | Test Size |
|---------|-------|-----------|
| ESOL | 1 | 113 |
| FreeSolv | 1 | 65 |
| Lipophilicity | 1 | 420 |
## Source
- **Original data**: [Uni-Mol GitHub](https://github.com/deepmodeling/Uni-Mol)
- **License**: MIT (DP Technology)
- **Paper**: [Uni-Mol: A Universal 3D Molecular Representation Learning Framework](https://openreview.net/forum?id=6K2RM6wVqKu)
---
许可证:MIT
标签:
- 化学(chemistry)
- 分子性质预测(molecular-property-prediction)
- MoleculeNet
- Uni-Mol
数据规模类别:10万<样本量<100万
---
# MoleculeNet Uni-Mol LMDB 数据集
本数据集为源自Uni-Mol(ICLR 2023论文)的预拆分MoleculeNet数据集,采用LMDB格式存储。
## 拆分方案
- **骨架拆分**:采用Bemis-Murcko拆分法,参数设置为`includeChirality=True`
- **拆分比例**:8:1:1(训练集:验证集:测试集)
- **遵循规范**:参考GEM(Fang等人,2022)与Uni-Mol(Zhou等人,ICLR 2023)的拆分设定
## 数据格式
每个数据集目录包含`train.lmdb`、`valid.lmdb`与`test.lmdb`三个文件。每个LMDB条目为经pickle序列化的字典,包含以下键:
- `smi`:SMILES(Simplified Molecular-Input Line-Entry System)字符串
- `atoms`:原子类型列表
- `coordinates`:三维坐标数组(形状:[构象数, 原子数, 3])
- `target`:单个或多个标签
## 数据集
### 分类任务(评价指标:ROC-AUC(受试者工作特征曲线下面积))
| 数据集 | 任务数 | 测试集规模 |
|---------|-------|-----------|
| BBBP | 1 | 204 |
| BACE | 1 | 152 |
| ClinTox | 2 | 148 |
| Tox21 | 12 | 784 |
| SIDER | 27 | 143 |
| HIV | 1 | 4,113 |
### 回归任务(评价指标:RMSE(均方根误差))
| 数据集 | 任务数 | 测试集规模 |
|---------|-------|-----------|
| ESOL | 1 | 113 |
| FreeSolv | 1 | 65 |
| Lipophilicity | 1 | 420 |
## 来源
- **原始数据**:[Uni-Mol GitHub仓库](https://github.com/deepmodeling/Uni-Mol)
- **许可证**:MIT(DP Technology)
- **相关论文**:[Uni-Mol: A Universal 3D Molecular Representation Learning Framework](https://openreview.net/forum?id=6K2RM6wVqKu)
提供机构:
leon14159



