ejy/ZINC_goldilocks_SMILES
收藏Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ejy/ZINC_goldilocks_SMILES
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: smiles
dtype: string
splits:
- name: train
num_bytes: 8364039
num_examples: 185896
download_size: 4293848
dataset_size: 8364039
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# From ZINC20 ['In-stock, Goldilocks'](https://zinc20.docking.org/tranches/home/) tranche
Steps to prepare the database:
1) Select the appropriate tranche from from ZINC20
- Select 'Purch' -> 'In-stock' + 'Exclusive'
- Select 'React' -> 'Standard' + 'Exclusive'
- Select 'Predefined Subsets' -> 'Goldilocks'
- Select 'Download Format' -> 'SMILES (*.smi)'
- Select 'Download Method' -> 'Raw URLs'
2) Download and concatenate the SMILES
```bash
# Download all ZINC20 tranches from 'in-stock, goldilocks' subset
mkdir zinc
wget -i ZINC-downloader-2D-smi.uri -P zinc
# Remove first line of every file and save into txt file
for i in zinc/*; do tail -n +2 "$i" > "$i".txt; done
# Concatenate all created files into one (contains 185896 ligands)
cat zinc/*.txt > zinc_all.txt
```
3) Parse the concatenated text file into a Huggingface dataset
```python
from datasets import load_dataset
dataset = load_dataset('text', data_files='zinc_all.txt')
# Split SMILES from ZINC_id and store only SMILES
def split_text(dataset):
split_item = dataset["text"].split()
return {"smiles": split_item[0]}
dataset = dataset.map(split_text)
dataset = dataset.remove_columns("text")
```
4) Compute ligand embeddings
```python
from sentence_transformers import SentenceTransformer
import numpy as np
def calculate_embeddings(input, model_name, max_length):
model = SentenceTransformer(model_name, trust_remote_code=True)
model.max_seq_length = max_length
embeddings = model.encode(input, show_progress_bar=True)
keys = [k for k in input]
np.savez(f"{model_name.replace("/", "_")}_embeddings.npz", keys=np.array(keys), embs=np.asarray(embeddings, dtype=np.float32))
calculate_embeddings(dataset['train'].unique("smiles"), "ibm-research/MoLFormer-XL-both-10pct", 118)
```
提供机构:
ejy



