Dataset for the machine learning-assisted discovery of defect-passivating additives in perovskite solar cells
收藏DataCite Commons2026-01-07 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=3ff43c7e618d4432a7c95fa7d42a0c80
下载链接
链接失效反馈官方服务:
资源简介:
This study constructs a representative dataset by systematically collecting and integrating experimental data from recent publicly available literature on the use of Lewis base additives for defect passivation in perovskite solar cells. The dataset comprises 146 experimental entries covering 44 distinct small‑molecule Lewis bases, with each entry corresponding to the device performance test results of a specific molecule at a given concentration. The variables included in the dataset are the numbers of C, H, O, N, and S atoms, molecular weight, topological polar surface area (TPSA), number of hydrogen‑bond donors, number of hydrogen‑bond acceptors, heavy atom count, molecular complexity, exact mass, number of rings, number of rotatable bonds, HOMO energy level, LUMO energy level, HOMO‑LUMO gap, dipole moment, octanol‑water partition coefficient (MLogP), additive concentration (in mg/mL), and the device performance metric—the power conversion efficiency enhancement (ΔPCE). Based on whether ΔPCE reaches 2% or higher, the samples are categorized into two classes (Class I: <2%; Class II: ≥2%), which serves as the prediction target for the machine‑learning classification model.During the data preprocessing and feature engineering stage, molecular structure information was first obtained from the PubChem database using SMILES notations, and physicochemical descriptors were calculated using the RDKit toolkit. Electronic structure parameters were further derived from density functional theory (DFT) calculations. Highly redundant features were eliminated through Pearson correlation analysis (e.g., exact mass was completely correlated with molecular weight, and heavy atom count was highly correlated with carbon atom count), resulting in a final set of 17 features. To address the class imbalance in the dataset (Class I to Class II ratio approximately 2.95:1), the Synthetic Minority Over‑sampling Technique (SMOTE) was applied, adjusting the sample ratio to about 1.20:1. All numerical features were normalized to the [0,1] interval via min‑max scaling to eliminate the influence of differing units on model training. No significant missing values were present in the dataset, as all features were fully populated based on either literature data or computational results.All data processing and analysis were carried out in a Python environment, relying on common libraries such as pandas, RDKit, scikit‑learn, and SHAP. The associated code and workflow exhibit good reproducibility and are suitable for research applying machine‑learning methods in materials science.
提供机构:
Science Data Bank
创建时间:
2026-01-07



