five

alessandronascimento/zinc20_chembl36

收藏
Hugging Face2026-02-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alessandronascimento/zinc20_chembl36
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - SMILES - chemistry - SELFIES - ZINC - CHEMBL size_categories: - 10M<n<100M --- # Dataset Card for ZINC20_Chembl36 This dataset contains about 13.6M molecules from ZINC20 and from Chembl26. The molecules are provided as SMILES and SELFIES, as well as tokens using MolGen tokenizer. ### ZINC20 The 2D tranches from ZINC20 were used and the *In Stock* molecules were selected and downloaded. The tranches were concatenated and converted to a dataset. Afterwards, the selfies library was used to convert SMILES to SELFIES. ### Chembl36 Chembl36 molecules were downloaded and converted to selfies using the selfies library. The SDF file was used to generate molecules in smiles with rdkit. ## Dataset Details - **Curated by:** Alessandro S. Nascimento - **License:** MIT ### Dataset Sources - **Web pages:** [ZINC20](https://zinc.docking.org/) and [Chembl](https://www.ebi.ac.uk/chembl/) - **Tokenizer used:** zjunlp/MolGen-large ## Dataset Split Strategy The train/test/validation splits were generated randomly using a 80/10/10% ratio.

--- 标签: - 简化分子线性输入规范(SMILES) - 化学 - SELFIES - ZINC - CHEMBL 规模类别: - 1000万 < 样本量 < 1亿 --- # ZINC20_Chembl36 数据集卡片 本数据集包含来自ZINC20与Chembl26的约1360万条分子数据。这些分子以简化分子线性输入规范(SMILES)、SELFIES格式,以及通过MolGen分词器(MolGen tokenizer)生成的Token形式提供。 ## ZINC20 本部分使用了ZINC20的二维数据集分块,并筛选并下载了其中的**现货**分子。将这些分块拼接后转换为数据集,随后借助SELFIES库将SMILES格式转换为SELFIES格式。 ## Chembl36 下载Chembl36的分子数据,并通过SELFIES库将其转换为SELFIES格式。借助RDKit工具从SDF文件中生成SMILES格式的分子数据。 ## 数据集详情 - **数据整理者:** Alessandro S. Nascimento - **授权协议:** MIT ### 数据集来源 - **来源网页:** [ZINC20](https://zinc.docking.org/) 与 [Chembl](https://www.ebi.ac.uk/chembl/) - **所用分词器:** zjunlp/MolGen-large ## 数据集划分策略 训练集/测试集/验证集以80/10/10%的比例随机划分生成。
提供机构:
alessandronascimento
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作