ejy/ZINC_goldilocks_SMILES

Hugging Face2025-11-30 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ejy/ZINC_goldilocks_SMILES

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: smiles dtype: string splits: - name: train num_bytes: 8364039 num_examples: 185896 download_size: 4293848 dataset_size: 8364039 configs: - config_name: default data_files: - split: train path: data/train-* --- # From ZINC20 ['In-stock, Goldilocks'](https://zinc20.docking.org/tranches/home/) tranche Steps to prepare the database: 1) Select the appropriate tranche from from ZINC20 - Select 'Purch' -> 'In-stock' + 'Exclusive' - Select 'React' -> 'Standard' + 'Exclusive' - Select 'Predefined Subsets' -> 'Goldilocks' - Select 'Download Format' -> 'SMILES (*.smi)' - Select 'Download Method' -> 'Raw URLs' 2) Download and concatenate the SMILES ```bash # Download all ZINC20 tranches from 'in-stock, goldilocks' subset mkdir zinc wget -i ZINC-downloader-2D-smi.uri -P zinc # Remove first line of every file and save into txt file for i in zinc/*; do tail -n +2 "$i" > "$i".txt; done # Concatenate all created files into one (contains 185896 ligands) cat zinc/*.txt > zinc_all.txt ``` 3) Parse the concatenated text file into a Huggingface dataset ```python from datasets import load_dataset dataset = load_dataset('text', data_files='zinc_all.txt') # Split SMILES from ZINC_id and store only SMILES def split_text(dataset): split_item = dataset["text"].split() return {"smiles": split_item[0]} dataset = dataset.map(split_text) dataset = dataset.remove_columns("text") ``` 4) Compute ligand embeddings ```python from sentence_transformers import SentenceTransformer import numpy as np def calculate_embeddings(input, model_name, max_length): model = SentenceTransformer(model_name, trust_remote_code=True) model.max_seq_length = max_length embeddings = model.encode(input, show_progress_bar=True) keys = [k for k in input] np.savez(f"{model_name.replace("/", "_")}_embeddings.npz", keys=np.array(keys), embs=np.asarray(embeddings, dtype=np.float32)) calculate_embeddings(dataset['train'].unique("smiles"), "ibm-research/MoLFormer-XL-both-10pct", 118) ```

提供机构：

ejy

5,000+

优质数据集

54 个

任务类型

进入经典数据集