five

Data and codes for RoBERTa-base repurposing to SELFIES chemical notation

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/3c27p5pzts
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains four Jupyter/Colab notebooks that reproduce the analyses reported in the manuscript. The Stage 1 notebook (Domain Adaptation) performs masked language modeling to adapt RoBERTa-base to a corpus of SELFIES strings, aligning the encoder with SELFIES grammar without introducing a new vocabulary. Stage 2 (Finetuning) applies compact supervision on seven QM9 quantum-chemical properties, shaping the embedding space toward structure–property relations. The REFPROP notebook generates thermophysical and transport property grids for 88 fluids from NIST REFPROP, producing the 108k vapor-phase state points used for further evaluation. A fourth notebook (Other Codes) contains supporting scripts for embedding extraction and mean-pooling, chemotype clustering, silhouette analysis, Mantel correlation testing, and multi-output regression with (T,P), RDKit descriptors, MoLFormer-XL-10pct, and SELFIES–QM9 embeddings. These codes enable reproduction of the training pipeline, property extraction, and evaluation protocols described in the study.
创建时间:
2025-10-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作