Data and codes for RoBERTa-base repurposing to SELFIES chemical notation
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/3c27p5pzts
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains four Jupyter/Colab notebooks that reproduce the analyses reported in the manuscript. The Stage 1 notebook (Domain Adaptation) performs masked language modeling to adapt RoBERTa-base to a corpus of SELFIES strings, aligning the encoder with SELFIES grammar without introducing a new vocabulary. Stage 2 (Finetuning) applies compact supervision on seven QM9 quantum-chemical properties, shaping the embedding space toward structure–property relations. The REFPROP notebook generates thermophysical and transport property grids for 88 fluids from NIST REFPROP, producing the 108k vapor-phase state points used for further evaluation. A fourth notebook (Other Codes) contains supporting scripts for embedding extraction and mean-pooling, chemotype clustering, silhouette analysis, Mantel correlation testing, and multi-output regression with (T,P), RDKit descriptors, MoLFormer-XL-10pct, and SELFIES–QM9 embeddings. These codes enable reproduction of the training pipeline, property extraction, and evaluation protocols described in the study.
创建时间:
2025-10-07



