five

datamol-io/safe-gpt

收藏
Hugging Face2026-01-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datamol-io/safe-gpt
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - chemistry - molecules - smiles - safe - drug-discovery size_categories: - 1B<n<10B configs: - config_name: default data_files: - split: train path: data/train/*.parquet - split: validation path: data/validation/*.parquet - split: test path: data/test/*.parquet --- # SAFE Molecules Dataset (v2) A large-scale molecular dataset containing approximately **1.17 billion unique molecules**, each represented with both **canonical SMILES** and **SAFE (Sequential Attachment-based Fragment Embedding)** strings. This dataset is intended to support **large-scale pretraining and evaluation of chemical language models**, including generative, conditional, and structure-aware modeling tasks. > **Note** > This is **version 2** of the SAFE dataset. The original v1 release contained invalid SAFE strings and is archived for reproducibility at > [https://huggingface.co/datasets/datamol-io/safe-gpt/tree/b83175cd7394](https://huggingface.co/datasets/datamol-io/safe-gpt/tree/b83175cd7394) ## SAFE Representation SAFE (Sequential Attachment-based Fragment Embedding) is a **fragment-based molecular string representation** that encodes molecules as **sequences of chemically meaningful fragments together with their attachment structure**. In SAFE, molecules are decomposed into fragments using rule-based fragmentation, and the resulting fragments are arranged into a **deterministic sequence** that explicitly represents how fragments are connected. The representation is **fully reversible**, allowing exact reconstruction of the original molecular graph. By operating at the **fragment level** rather than the atom level (as in SMILES), SAFE reduces syntactic fragility and naturally supports both **unconstrained molecular generation** and **structure-constrained tasks** (e.g., scaffold or fragment conditioning) using standard sequence models. Additional resources: * **SAFE GitHub repository**: [https://github.com/datamol-io/safe](https://github.com/datamol-io/safe) * **SAFE-based models on Hugging Face**: * [SAFE-GPT 87M](https://huggingface.co/datamol-io/safe-gpt) * [NovoMolGen 32M-BPE](https://huggingface.co/bisectgroup/NovoMolGen_32M_SAFE_BPE) * [NVIDIA's GenMol 89M](https://huggingface.co/nvidia/NV-GenMol-89M-v2) ## Dataset Description The dataset aggregates molecules from two major public chemical resources: * **ZINC20**: ~1.0 billion commercially available, purchasable compounds * **UniChem**: ~188 million compounds aggregated from multiple public databases After standardization and deduplication, the dataset contains **~1.17 billion unique molecules**. Each molecule is provided with: * `mol_id`: Source-specific molecule identifier * `smiles`: Canonical SMILES string * `safe`: Canonical SAFE string representation (BRICS-based fragmentation) * `source`: Origin of the molecule (`zinc20` or `unichem`) Due to the scale of the dataset, **streaming access is recommended** for most use cases. ## Dataset Splits | Split | Molecules | Proportion | | ---------- | --------- | ---------- | | Train | ~933M | 80% | | Validation | ~117M | 10% | | Test | ~117M | 10% | ## Usage Example ```python from datasets import load_dataset # Load dataset (streaming recommended) dataset = load_dataset("datamol-io/safe-gpt", streaming=True) train = dataset["train"] val = dataset["validation"] test = dataset["test"] ``` --- ## Citation If you use this dataset or the SAFE representation, please cite the SAFE paper: ```bibtex @article{noutahi2024gotta, title={Gotta be SAFE: a new framework for molecular design}, author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio}, journal={Digital Discovery}, volume={3}, number={4}, pages={796--804}, year={2024}, publisher={Royal Society of Chemistry} } ```
提供机构:
datamol-io
原始信息汇总

数据集概述

许可证

配置

  • 默认配置 (default) 包含以下数据文件:
    • 训练集 (train):路径为 data/train-*
    • 测试集 (test):路径为 data/test-*
    • 验证集 (validation):路径为 data/validation-*

数据集信息

特征

  • input:数据类型为字符串 (string)
  • mc_labels:数据类型为浮点数序列 (float64)

数据分割

  • 训练集 (train):
    • 字节数:203,939,038,678
    • 样本数:945,455,307
  • 测试集 (test):
    • 字节数:25,523,244,912
    • 样本数:118,890,444
  • 验证集 (validation):
    • 字节数:24,920,275,439
    • 样本数:118,451,032

数据集大小

  • 下载大小:270,730,145 字节
  • 数据集大小:254,382,559,029 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作