ZINC_22
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/chandar-lab/ZINC_22
下载链接
链接失效反馈官方服务:
资源简介:
# ZINC_22 Pretraining Dataset
## Dataset Description
This dataset is derived from the **ZINC-22** database (~70B synthesizable compounds as of Sept 2024) and was prepared for large-scale pretraining of molecular language models. We randomly sampled **1.5 billion molecules** using a **stratified heavy-atom count split** (4–49 atoms) to ensure coverage of diverse chemical sizes.
All molecules were **deduplicated** to remove repeats, **canonicalized** in SMILES format, and **converted** into multiple string representations: SMILES, SELFIES, SAFE, DeepSMILES.
---
## Precomputed Statistics
This repository includes precomputed reference statistics (`*_stats.pkl`) for evaluating generated molecules against validation and test sets.
These statistics are used to compute the following metrics:
- **FCD** – Fréchet ChemNet Distance
- **SNN** – Similarity to Nearest Neighbor
- **Frag** – Fragment similarity (BRICS decomposition)
- **Scaf** – Scaffold similarity (Bemis–Murcko scaffolds)
### File Naming Convention
Files are provided for multiple reference set sizes:
- `_175k` → 175,000 molecules
- `_500k` → 500,000 molecules
- `_1M` → 1 million molecules
- `_3M` → 3 million molecules
- *(no suffix)* → full set
By convention:
- `valid_stats_*` → computed from the **random validation split**
- `test_stats_*` → computed from the **scaffold-based split**
These statistics enable **consistent and reproducible** evaluation across experiments.
---
## How to Use
Before running the example below, make sure you have these packages installed:
```bash
pip install rdkit fcd-torch
```
### Example: Download stats from the Hub and compute FCD
```python
from huggingface_hub import hf_hub_download
import pickle
from fcd_torch import FCD as FCDMetric
# 1. Download the precomputed stats file from Hugging Face Hub
stats_path = hf_hub_download(
repo_id="chandar-lab/ZINC_22",
repo_type="dataset",
filename="valid_stats_175k.pkl" # change to desired file
)
# 2. Load the reference stats
with open(stats_path, "rb") as f:
reference_stats = pickle.load(f)
# 3. Compute FCD for your generated molecules
generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # replace with your generated set
fcd_calculator = FCDMetric(batch_size=4)
fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"])
print(f"FCD score: {fcd_value:.4f}")
```
## Citation
```bibtex
@misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
year={2025},
eprint={2508.13408},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.13408},
}
```
# ZINC_22 预训练数据集
## 数据集说明
本数据集源自**ZINC-22**数据库(截至2024年9月收录约700亿个可合成化合物),专为分子语言模型的大规模预训练任务构建。我们采用**分层重原子计数划分策略**(原子数范围为4~49)随机采样得到**15亿个分子**,以确保覆盖多样化的分子尺寸区间。
所有分子均经过**去重**处理以移除重复条目、以**简化分子线性输入规范(SMILES)**格式完成标准化,并转换为多种字符串表示形式:SMILES、SELFIES、SAFE、DeepSMILES。
---
## 预计算统计量
本仓库包含预计算的参考统计文件(`*_stats.pkl`),用于基于验证集与测试集评估生成的分子。这些统计量可用于计算以下指标:
- **FCD**——弗雷歇ChemNet距离(Fréchet ChemNet Distance)
- **SNN**——最近邻相似度(Similarity to Nearest Neighbor)
- **Frag**——片段相似度(基于BRICS分解)
- **Scaf**——骨架相似度(基于Bemis–Murcko骨架)
### 文件命名规范
本仓库提供了多种参考集规模对应的统计文件:
- `_175k` → 对应175,000个分子
- `_500k` → 对应500,000个分子
- `_1M` → 对应100万个分子
- `_3M` → 对应300万个分子
- (无后缀)→ 对应全量数据集
命名约定如下:
- `valid_stats_*` → 基于**随机验证划分集**计算得到的统计量
- `test_stats_*` → 基于**基于骨架的划分集**计算得到的统计量
上述统计量可确保不同实验间的评估具备一致性与可复现性。
---
## 使用方法
在运行下述示例前,请确保已安装以下依赖包:
bash
pip install rdkit fcd-torch
### 示例:从Hub下载统计文件并计算FCD
python
from huggingface_hub import hf_hub_download
import pickle
from fcd_torch import FCD as FCDMetric
# 1. 从Hugging Face Hub下载预计算的统计文件
stats_path = hf_hub_download(
repo_id="chandar-lab/ZINC_22",
repo_type="dataset",
filename="valid_stats_175k.pkl" # 可替换为所需文件
)
# 2. 加载参考统计量
with open(stats_path, "rb") as f:
reference_stats = pickle.load(f)
# 3. 为你的生成分子计算FCD值
generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # 替换为你的生成分子集合
fcd_calculator = FCDMetric(batch_size=4)
fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"])
print(f"FCD得分:{fcd_value:.4f}")
## 引用
bibtex
@misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
year={2025},
eprint={2508.13408},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.13408},
}
提供机构:
maas
创建时间:
2025-08-09



