datamol-io/safe-gpt

Name: datamol-io/safe-gpt
Creator: datamol-io
Published: 2026-01-12 13:04:26
License: 暂无描述

Hugging Face2026-01-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datamol-io/safe-gpt

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - chemistry - molecules - smiles - safe - drug-discovery size_categories: - 1B<n<10B configs: - config_name: default data_files: - split: train path: data/train/*.parquet - split: validation path: data/validation/*.parquet - split: test path: data/test/*.parquet --- # SAFE Molecules Dataset (v2) A large-scale molecular dataset containing approximately **1.17 billion unique molecules**, each represented with both **canonical SMILES** and **SAFE (Sequential Attachment-based Fragment Embedding)** strings. This dataset is intended to support **large-scale pretraining and evaluation of chemical language models**, including generative, conditional, and structure-aware modeling tasks. > **Note** > This is **version 2** of the SAFE dataset. The original v1 release contained invalid SAFE strings and is archived for reproducibility at > [https://huggingface.co/datasets/datamol-io/safe-gpt/tree/b83175cd7394](https://huggingface.co/datasets/datamol-io/safe-gpt/tree/b83175cd7394) ## SAFE Representation SAFE (Sequential Attachment-based Fragment Embedding) is a **fragment-based molecular string representation** that encodes molecules as **sequences of chemically meaningful fragments together with their attachment structure**. In SAFE, molecules are decomposed into fragments using rule-based fragmentation, and the resulting fragments are arranged into a **deterministic sequence** that explicitly represents how fragments are connected. The representation is **fully reversible**, allowing exact reconstruction of the original molecular graph. By operating at the **fragment level** rather than the atom level (as in SMILES), SAFE reduces syntactic fragility and naturally supports both **unconstrained molecular generation** and **structure-constrained tasks** (e.g., scaffold or fragment conditioning) using standard sequence models. Additional resources: * **SAFE GitHub repository**: [https://github.com/datamol-io/safe](https://github.com/datamol-io/safe) * **SAFE-based models on Hugging Face**: * [SAFE-GPT 87M](https://huggingface.co/datamol-io/safe-gpt) * [NovoMolGen 32M-BPE](https://huggingface.co/bisectgroup/NovoMolGen_32M_SAFE_BPE) * [NVIDIA's GenMol 89M](https://huggingface.co/nvidia/NV-GenMol-89M-v2) ## Dataset Description The dataset aggregates molecules from two major public chemical resources: * **ZINC20**: ~1.0 billion commercially available, purchasable compounds * **UniChem**: ~188 million compounds aggregated from multiple public databases After standardization and deduplication, the dataset contains **~1.17 billion unique molecules**. Each molecule is provided with: * `mol_id`: Source-specific molecule identifier * `smiles`: Canonical SMILES string * `safe`: Canonical SAFE string representation (BRICS-based fragmentation) * `source`: Origin of the molecule (`zinc20` or `unichem`) Due to the scale of the dataset, **streaming access is recommended** for most use cases. ## Dataset Splits | Split | Molecules | Proportion | | ---------- | --------- | ---------- | | Train | ~933M | 80% | | Validation | ~117M | 10% | | Test | ~117M | 10% | ## Usage Example ```python from datasets import load_dataset # Load dataset (streaming recommended) dataset = load_dataset("datamol-io/safe-gpt", streaming=True) train = dataset["train"] val = dataset["validation"] test = dataset["test"] ``` --- ## Citation If you use this dataset or the SAFE representation, please cite the SAFE paper: ```bibtex @article{noutahi2024gotta, title={Gotta be SAFE: a new framework for molecular design}, author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio}, journal={Digital Discovery}, volume={3}, number={4}, pages={796--804}, year={2024}, publisher={Royal Society of Chemistry} } ```

提供机构：

datamol-io

原始信息汇总

数据集概述

许可证

该数据集遵循 CC BY 4.0 许可证。

配置

默认配置 (default) 包含以下数据文件：
- 训练集 (train)：路径为 data/train-*
- 测试集 (test)：路径为 data/test-*
- 验证集 (validation)：路径为 data/validation-*

数据集信息

特征

input：数据类型为字符串 (string)
mc_labels：数据类型为浮点数序列 (float64)

数据分割

训练集 (train)：
- 字节数：203,939,038,678
- 样本数：945,455,307
测试集 (test)：
- 字节数：25,523,244,912
- 样本数：118,890,444
验证集 (validation)：
- 字节数：24,920,275,439
- 样本数：118,451,032

数据集大小

下载大小：270,730,145 字节
数据集大小：254,382,559,029 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集