An explainable Raman spectral analysis pipeline via nonnegative matrix factorization, machine learning, and Shapley additive explanations
收藏DataCite Commons2025-09-04 更新2026-05-04 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2024.521
下载链接
链接失效反馈官方服务:
资源简介:
Ink analysis plays a crucial role in providing valuable information to validate the authenticity of questioned documents in the forensic document examination (FDE) field, which is at risk due to the shrinking number of FDE personnel. Raman spectroscopy is a non-destructive and cost-effective method for capturing the unique spectral fingerprints of pen ink samples, yet conventional analysis method requires expert reading of the acquired spectra, which is labor-intensive. To address these gaps, this thesis proposes an explainable pipeline for Raman spectral classification to distinguish pen ink colors based on two experiments. Our 1,305 Raman spectra are collected from 53 blue pens, 42 black pens, and 50 red pens, each with a unique pen model. To improve the signal-to-noise ratio of Raman spectra for both experiments, preprocessing steps are applied, including cosmic ray removal, smoothing, baseline correction, despiking, and max normalization. Data augmentation is conducted to increase the training data. In our first experiment, the weights of the 48 non-negative matrix factorization (NMF) components were extracted from the preprocessed Raman spectra to be used as inputs for the Random Forest classifier, achieving a test accuracy of 0.8456. NMF allows underlying linear compositions of spectra associated with ink color classes to be observed, and SHapley Additive exPlanations (SHAP) then enables a quantification of the average impact of each NMF component on the classification. The scalar multiplication between NMF components, their weights, and their SHAP values enables a human-intuitive explanation of the model prediction results, capturing the unique Raman spectral patterns expected for each ink color class. In our second experiment, the 30 common peak locations in our preprocessed spectra are detected through peak detection and hierarchical clustering. Peak binning is conducted on each preprocessed spectrum to extract the aggregated value at each of 30 common peak locations. K-means clustering is conducted on our peak-binned spectra to investigate the inter-class and intra-class spectral patterns. The weights of the 21 NMF components extracted from the peak-binned spectra are used as inputs for the Random Forest classifier, achieving a test accuracy of 0.8380. SHAP quantifies the feature attribution and importance of each NMF component on model prediction, capturing the meaningful Raman spectral patterns expected for each ink color class at the level of common peak locations. Our thesis contributes an explainable spectral classification pipeline using NMF and SHAP, addressing the literature gaps in NMF applications for pen ink spectral analysis and laying the groundwork for future research on more complex ink analysis tasks to mitigate the personnel shortage issue in the FDE field
提供机构:
Thammasat University
创建时间:
2025-09-04



