introvoyz041/PubChem10M_SMILES_SELFIES
收藏Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/introvoyz041/PubChem10M_SMILES_SELFIES
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: selfies
dtype: string
splits:
- name: train
num_bytes: 2484550079
num_examples: 9999225
download_size: 847657891
dataset_size: 2484550079
---
# A Dataset of ~10M molecules, converted from SMILES to SELFIES
- PubChem10M subset available from [DeepChemData](https://deepchemdata.s3-us-west-1.amazonaws.com/index.html)
- Using the [Self-Referencing Embedded Strings (SELFIES)](https://github.com/aspuru-guzik-group/selfies) molecular representation
Converted using the following code snippet:
```python
from datasets import load_dataset
import selfies
dataset = load_dataset("text", data_files="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip")
def smiles_to_selfies(dataset):
try:
return {"selfies": selfies.encoder(dataset["text"])}
except selfies.EncoderError:
return {"selfies": None}
dataset = dataset.map(smiles_to_selfies)
dataset = dataset.filter(lambda dataset: dataset["selfies"] != None)
```
提供机构:
introvoyz041



