ZINC_22

Name: ZINC_22
Creator: maas
Published: 2025-12-05 12:08:38
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/chandar-lab/ZINC_22

下载链接

链接失效反馈

官方服务：

资源简介：

# ZINC_22 Pretraining Dataset ## Dataset Description This dataset is derived from the **ZINC-22** database (~70B synthesizable compounds as of Sept 2024) and was prepared for large-scale pretraining of molecular language models. We randomly sampled **1.5 billion molecules** using a **stratified heavy-atom count split** (4–49 atoms) to ensure coverage of diverse chemical sizes. All molecules were **deduplicated** to remove repeats, **canonicalized** in SMILES format, and **converted** into multiple string representations: SMILES, SELFIES, SAFE, DeepSMILES. --- ## Precomputed Statistics This repository includes precomputed reference statistics (`*_stats.pkl`) for evaluating generated molecules against validation and test sets. These statistics are used to compute the following metrics: - **FCD** – Fréchet ChemNet Distance - **SNN** – Similarity to Nearest Neighbor - **Frag** – Fragment similarity (BRICS decomposition) - **Scaf** – Scaffold similarity (Bemis–Murcko scaffolds) ### File Naming Convention Files are provided for multiple reference set sizes: - `_175k` → 175,000 molecules - `_500k` → 500,000 molecules - `_1M` → 1 million molecules - `_3M` → 3 million molecules - *(no suffix)* → full set By convention: - `valid_stats_*` → computed from the **random validation split** - `test_stats_*` → computed from the **scaffold-based split** These statistics enable **consistent and reproducible** evaluation across experiments. --- ## How to Use Before running the example below, make sure you have these packages installed: ```bash pip install rdkit fcd-torch ``` ### Example: Download stats from the Hub and compute FCD ```python from huggingface_hub import hf_hub_download import pickle from fcd_torch import FCD as FCDMetric # 1. Download the precomputed stats file from Hugging Face Hub stats_path = hf_hub_download( repo_id="chandar-lab/ZINC_22", repo_type="dataset", filename="valid_stats_175k.pkl" # change to desired file ) # 2. Load the reference stats with open(stats_path, "rb") as f: reference_stats = pickle.load(f) # 3. Compute FCD for your generated molecules generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # replace with your generated set fcd_calculator = FCDMetric(batch_size=4) fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"]) print(f"FCD score: {fcd_value:.4f}") ``` ## Citation ```bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, } ```

# ZINC_22 预训练数据集 ## 数据集说明本数据集源自**ZINC-22**数据库（截至2024年9月收录约700亿个可合成化合物），专为分子语言模型的大规模预训练任务构建。我们采用**分层重原子计数划分策略**（原子数范围为4~49）随机采样得到**15亿个分子**，以确保覆盖多样化的分子尺寸区间。所有分子均经过**去重**处理以移除重复条目、以**简化分子线性输入规范（SMILES）**格式完成标准化，并转换为多种字符串表示形式：SMILES、SELFIES、SAFE、DeepSMILES。 --- ## 预计算统计量本仓库包含预计算的参考统计文件（`*_stats.pkl`），用于基于验证集与测试集评估生成的分子。这些统计量可用于计算以下指标： - **FCD**——弗雷歇ChemNet距离（Fréchet ChemNet Distance） - **SNN**——最近邻相似度（Similarity to Nearest Neighbor） - **Frag**——片段相似度（基于BRICS分解） - **Scaf**——骨架相似度（基于Bemis–Murcko骨架） ### 文件命名规范本仓库提供了多种参考集规模对应的统计文件： - `_175k` → 对应175,000个分子 - `_500k` → 对应500,000个分子 - `_1M` → 对应100万个分子 - `_3M` → 对应300万个分子 - （无后缀）→ 对应全量数据集命名约定如下： - `valid_stats_*` → 基于**随机验证划分集**计算得到的统计量 - `test_stats_*` → 基于**基于骨架的划分集**计算得到的统计量上述统计量可确保不同实验间的评估具备一致性与可复现性。 --- ## 使用方法在运行下述示例前，请确保已安装以下依赖包： bash pip install rdkit fcd-torch ### 示例：从Hub下载统计文件并计算FCD python from huggingface_hub import hf_hub_download import pickle from fcd_torch import FCD as FCDMetric # 1. 从Hugging Face Hub下载预计算的统计文件 stats_path = hf_hub_download( repo_id="chandar-lab/ZINC_22", repo_type="dataset", filename="valid_stats_175k.pkl" # 可替换为所需文件 ) # 2. 加载参考统计量 with open(stats_path, "rb") as f: reference_stats = pickle.load(f) # 3. 为你的生成分子计算FCD值 generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # 替换为你的生成分子集合 fcd_calculator = FCDMetric(batch_size=4) fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"]) print(f"FCD得分：{fcd_value:.4f}") ## 引用 bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, }

提供机构：

maas

创建时间：

2025-08-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集