Large Language Model Predicting the Corrosion Inhibition Efficiency Based on Text Embedding for Small Dataset
收藏Zenodo2026-05-15 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20193233
下载链接
链接失效反馈官方服务:
资源简介:
Update v1.1: This version represents a major update to the experimental framework, ensuring full reproducibility of the manuscript's results. Key updates include the integration of pre-computed LLM embeddings, hybrid GNN model evaluation, and comprehensive performance comparison scripts.
LLM-Corrosion-Inhibition
Official repository for the paper: "Large Language Model Predicting the Corrosion Inhibition Efficiency Based on Text Embedding for Small Dataset"
This project provides a framework for predicting corrosion inhibition efficiency using text embeddings from Large Language Models (LLMs) combined with gradient boosting and graph-based models.
Performance Comparison
Our proposed LLM-based approach (Group E) demonstrates superior performance compared to traditional machine learning and deep learning models:
Model
R2
MAE
RMSE
Pearson Rho
Proposed (LLM-based E)
0.5341
49.0238
62.0296
0.7625
GNN-based (Hybrid)
0.41
55.0438
66.0307
0.6999
XGBoost (Baseline A)
0.3229
65.4353
74.7848
0.5725
RF (Optimized)
0.3153
64.0658
75.1997
0.5642
SVM (Optimized)
0.2027
68.5637
81.1471
0.4940
MLP (Optimized)
0.2703
68.9599
77.6349
0.5425
Getting Started
1. Environment Setup
We recommend using a virtual environment (Python >= 3.8.2):
Bash
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate
# or for Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
2. Usage
To run the proposed LLM-based model (Group E):
Bash
python E.py
3. File Structure
Data Files:
ze41_combined.csv: Raw experimental dataset.
E-embedding.csv: Text-embeddings (1024-dim) generated via LLM API.
Code Experiments:
A.py, B.py, C.py, D.py, E.py: Progressive experimental scripts.
gnn.py: Hybrid GNN (GCN-like + MLP) evaluation.
RFE.py: Feature importance analysis using 10-fold cross-validation.
tml.py: Comparison with traditional ML models (SVM, RF, MLP).
Documentation
For detailed logs and parameters, please refer to the included files:
save.md: Full terminal outputs and logs for all scripts.
Xgboost.md: Detailed record of all XGBoost hyperparameters.
reame.md: this files.
Important Notes
API Access: In E.py, the API_KEY is set to "YOUR_API_KEY". The script automatically loads pre-computed vectors from E-embedding.csv to ensure reproducibility without an active API.
Hardware: gnn.py utilizes CUDA by default and falls back to CPU if no compatible GPU is detected.
Citations and Acknowledgements
Data Source
If you use the raw dataset (ze41_combined.csv), please cite:
[1] https://doi.org/10.1038/s41529-023-00391-0
This Research
If you find this code helpful, please cite our publication:
(Insert your publication citation here)
提供机构:
Zenodo
创建时间:
2026-05-15



