alessandronascimento/zinc20_chembl36
收藏Hugging Face2026-02-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alessandronascimento/zinc20_chembl36
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- SMILES
- chemistry
- SELFIES
- ZINC
- CHEMBL
size_categories:
- 10M<n<100M
---
# Dataset Card for ZINC20_Chembl36
This dataset contains about 13.6M molecules from ZINC20 and from Chembl26. The molecules are provided as SMILES and SELFIES, as well as tokens using MolGen tokenizer.
### ZINC20
The 2D tranches from ZINC20 were used and the *In Stock* molecules were selected and downloaded. The tranches were concatenated and converted to a dataset. Afterwards, the selfies library was used to convert SMILES to SELFIES.
### Chembl36
Chembl36 molecules were downloaded and converted to selfies using the selfies library. The SDF file was used to generate molecules in smiles with rdkit.
## Dataset Details
- **Curated by:** Alessandro S. Nascimento
- **License:** MIT
### Dataset Sources
- **Web pages:** [ZINC20](https://zinc.docking.org/) and [Chembl](https://www.ebi.ac.uk/chembl/)
- **Tokenizer used:** zjunlp/MolGen-large
## Dataset Split Strategy
The train/test/validation splits were generated randomly using a 80/10/10% ratio.
---
标签:
- 简化分子线性输入规范(SMILES)
- 化学
- SELFIES
- ZINC
- CHEMBL
规模类别:
- 1000万 < 样本量 < 1亿
---
# ZINC20_Chembl36 数据集卡片
本数据集包含来自ZINC20与Chembl26的约1360万条分子数据。这些分子以简化分子线性输入规范(SMILES)、SELFIES格式,以及通过MolGen分词器(MolGen tokenizer)生成的Token形式提供。
## ZINC20
本部分使用了ZINC20的二维数据集分块,并筛选并下载了其中的**现货**分子。将这些分块拼接后转换为数据集,随后借助SELFIES库将SMILES格式转换为SELFIES格式。
## Chembl36
下载Chembl36的分子数据,并通过SELFIES库将其转换为SELFIES格式。借助RDKit工具从SDF文件中生成SMILES格式的分子数据。
## 数据集详情
- **数据整理者:** Alessandro S. Nascimento
- **授权协议:** MIT
### 数据集来源
- **来源网页:** [ZINC20](https://zinc.docking.org/) 与 [Chembl](https://www.ebi.ac.uk/chembl/)
- **所用分词器:** zjunlp/MolGen-large
## 数据集划分策略
训练集/测试集/验证集以80/10/10%的比例随机划分生成。
提供机构:
alessandronascimento



