Data and Codes for: Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES

Name: Data and Codes for: Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES
Creator: Mendeley Data
Published: 2025-05-06 05:13:08
License: 暂无描述

DataCite Commons2025-05-06 更新2025-05-17 收录

下载链接：

https://data.mendeley.com/datasets/27j2zg6f5x/2

下载链接

链接失效反馈

官方服务：

资源简介：

This repository supports the manuscript “Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES,” providing all necessary data, models, and code for reproducing the reported experiments and figures. It includes two core datasets (SMILES_to_SELFIES.csv and Filtered_QM9.csv) for SELFIES-based finetuning and QM9 regression, along with a zip archive (selfies_finetuned_model.zip) containing the final ChemBERTa model finetuned on SELFIES. Five Jupyter notebooks are provided: Finetuning and Figures.ipynb, QM9 regression: SELFIES FT model.ipynb, QM9 regression: ChemBERTa-77M-MLM model.ipynb, QM9 Regression: ChemBERTa-zinc-base-v1 model.ipynb, and Finetuning on Benchmark Datasets.ipynb. The last of these implements additional finetuning and performance evaluation of the SELFIES-repurposed model on three standard benchmark datasets: ESOL, FreeSolv, and Lipophilicity. Each notebook illustrates the full methodology, from data preparation through model training and evaluation.

提供机构：

Mendeley Data

创建时间：

2025-05-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集