Data and Codes for: Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES
收藏DataCite Commons2025-05-06 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/27j2zg6f5x/2
下载链接
链接失效反馈官方服务:
资源简介:
This repository supports the manuscript “Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES,” providing all necessary data, models, and code for reproducing the reported experiments and figures. It includes two core datasets (SMILES_to_SELFIES.csv and Filtered_QM9.csv) for SELFIES-based finetuning and QM9 regression, along with a zip archive (selfies_finetuned_model.zip) containing the final ChemBERTa model finetuned on SELFIES.
Five Jupyter notebooks are provided: Finetuning and Figures.ipynb, QM9 regression: SELFIES FT model.ipynb, QM9 regression: ChemBERTa-77M-MLM model.ipynb, QM9 Regression: ChemBERTa-zinc-base-v1 model.ipynb, and Finetuning on Benchmark Datasets.ipynb. The last of these implements additional finetuning and performance evaluation of the SELFIES-repurposed model on three standard benchmark datasets: ESOL, FreeSolv, and Lipophilicity. Each notebook illustrates the full methodology, from data preparation through model training and evaluation.
提供机构:
Mendeley Data
创建时间:
2025-05-02



