Molecules used to train or generated by chemical language models

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8321734

下载链接

链接失效反馈

官方服务：

资源简介：

This upload contains training datasets or generated molecules from the paper “Invalid SMILES are helpful, not harmful, for chemical language models.” The contents of the directories are as follows: training_sets: sets of molecules from ChEMBL or GDB-13 used to train chemical language models, represented either as SMILES or SELFIES sampled-*: unprocessed samples of 10 million molecules from each model trained on ChEMBL or GDB-13 prior_inputs: sets of molecules from LOTUS, COCONUT, FooDB and NORMAN, split into ten folds and used to train chemical language models priors-*: samples of 100 million molecules from chemical language models trained on each cross-validation fold, with unique molecules represented as canonical SMILES and sorted in descending order by their sampling frequency

创建时间：

2024-02-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集