five

ZINC_22

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/chandar-lab/ZINC_22
下载链接
链接失效反馈
官方服务:
资源简介:
# ZINC_22 Pretraining Dataset ## Dataset Description This dataset is derived from the **ZINC-22** database (~70B synthesizable compounds as of Sept 2024) and was prepared for large-scale pretraining of molecular language models. We randomly sampled **1.5 billion molecules** using a **stratified heavy-atom count split** (4–49 atoms) to ensure coverage of diverse chemical sizes. All molecules were **deduplicated** to remove repeats, **canonicalized** in SMILES format, and **converted** into multiple string representations: SMILES, SELFIES, SAFE, DeepSMILES. --- ## Precomputed Statistics This repository includes precomputed reference statistics (`*_stats.pkl`) for evaluating generated molecules against validation and test sets. These statistics are used to compute the following metrics: - **FCD** – Fréchet ChemNet Distance - **SNN** – Similarity to Nearest Neighbor - **Frag** – Fragment similarity (BRICS decomposition) - **Scaf** – Scaffold similarity (Bemis–Murcko scaffolds) ### File Naming Convention Files are provided for multiple reference set sizes: - `_175k` → 175,000 molecules - `_500k` → 500,000 molecules - `_1M` → 1 million molecules - `_3M` → 3 million molecules - *(no suffix)* → full set By convention: - `valid_stats_*` → computed from the **random validation split** - `test_stats_*` → computed from the **scaffold-based split** These statistics enable **consistent and reproducible** evaluation across experiments. --- ## How to Use Before running the example below, make sure you have these packages installed: ```bash pip install rdkit fcd-torch ``` ### Example: Download stats from the Hub and compute FCD ```python from huggingface_hub import hf_hub_download import pickle from fcd_torch import FCD as FCDMetric # 1. Download the precomputed stats file from Hugging Face Hub stats_path = hf_hub_download( repo_id="chandar-lab/ZINC_22", repo_type="dataset", filename="valid_stats_175k.pkl" # change to desired file ) # 2. Load the reference stats with open(stats_path, "rb") as f: reference_stats = pickle.load(f) # 3. Compute FCD for your generated molecules generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # replace with your generated set fcd_calculator = FCDMetric(batch_size=4) fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"]) print(f"FCD score: {fcd_value:.4f}") ``` ## Citation ```bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, } ```

# ZINC_22 预训练数据集 ## 数据集说明 本数据集源自**ZINC-22**数据库(截至2024年9月收录约700亿个可合成化合物),专为分子语言模型的大规模预训练任务构建。我们采用**分层重原子计数划分策略**(原子数范围为4~49)随机采样得到**15亿个分子**,以确保覆盖多样化的分子尺寸区间。 所有分子均经过**去重**处理以移除重复条目、以**简化分子线性输入规范(SMILES)**格式完成标准化,并转换为多种字符串表示形式:SMILES、SELFIES、SAFE、DeepSMILES。 --- ## 预计算统计量 本仓库包含预计算的参考统计文件(`*_stats.pkl`),用于基于验证集与测试集评估生成的分子。这些统计量可用于计算以下指标: - **FCD**——弗雷歇ChemNet距离(Fréchet ChemNet Distance) - **SNN**——最近邻相似度(Similarity to Nearest Neighbor) - **Frag**——片段相似度(基于BRICS分解) - **Scaf**——骨架相似度(基于Bemis–Murcko骨架) ### 文件命名规范 本仓库提供了多种参考集规模对应的统计文件: - `_175k` → 对应175,000个分子 - `_500k` → 对应500,000个分子 - `_1M` → 对应100万个分子 - `_3M` → 对应300万个分子 - (无后缀)→ 对应全量数据集 命名约定如下: - `valid_stats_*` → 基于**随机验证划分集**计算得到的统计量 - `test_stats_*` → 基于**基于骨架的划分集**计算得到的统计量 上述统计量可确保不同实验间的评估具备一致性与可复现性。 --- ## 使用方法 在运行下述示例前,请确保已安装以下依赖包: bash pip install rdkit fcd-torch ### 示例:从Hub下载统计文件并计算FCD python from huggingface_hub import hf_hub_download import pickle from fcd_torch import FCD as FCDMetric # 1. 从Hugging Face Hub下载预计算的统计文件 stats_path = hf_hub_download( repo_id="chandar-lab/ZINC_22", repo_type="dataset", filename="valid_stats_175k.pkl" # 可替换为所需文件 ) # 2. 加载参考统计量 with open(stats_path, "rb") as f: reference_stats = pickle.load(f) # 3. 为你的生成分子计算FCD值 generated_smiles = ["CCO", "CCN", "CCCN", "CCCN"] # 替换为你的生成分子集合 fcd_calculator = FCDMetric(batch_size=4) fcd_value = fcd_calculator(gen=generated_smiles, pref=reference_stats["FCD"]) print(f"FCD得分:{fcd_value:.4f}") ## 引用 bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, }
提供机构:
maas
创建时间:
2025-08-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作