sagawa/ZINC-canonicalized
收藏Hugging Face2022-09-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sagawa/ZINC-canonicalized
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language: []
language_creators:
- expert-generated
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: canonicalized ZINC
size_categories:
- 10M<n<100M
source_datasets:
- original
tags:
- ZINC
- chemical
- SMILES
task_categories: []
task_ids: []
---
### dataset description
We downloaded ZINC dataset from [here](https://zinc15.docking.org/) and canonicalized it.
We used the following function to canonicalize the data and removed some SMILES that cannot be read by RDKit.
```python:
from rdkit import Chem
def canonicalize(mol):
mol = Chem.MolToSmiles(Chem.MolFromSmiles(mol),True)
return mol
```
We randomly split the preprocessed data into train and validation. The ratio is 9 : 1.
提供机构:
sagawa
原始信息汇总
数据集概述
基本信息
- 名称: canonicalized ZINC
- 语言: 单语种(无具体语言标识)
- 数据来源: 原始数据
- 许可证: Apache-2.0
- 数据规模: 10M<n<100M
- 标签: ZINC, chemical, SMILES
数据处理
- 数据下载源: ZINC dataset
- 数据处理方法: 使用RDKit库进行数据规范化处理,移除了无法被RDKit读取的SMILES数据。
数据分割
- 分割方式: 随机分割
- 分割比例: 训练集与验证集比例为9:1



