Open format raw data for MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study

Name: Open format raw data for MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study
Creator: Zenodo
Published: 2025-03-30 06:18:33
License: 暂无描述

Zenodo2025-03-30 更新2026-05-26 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.15064821

下载链接

链接失效反馈

官方服务：

资源简介：

Raw data in HDF5 format accompanying "MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study". Abstract Background Basal cell carcinoma (BCC) comprises a large portion of dermatopathology specimens; however, benign mimics such as trichoblastoma/trichoepithelioma (TB/TE) place accurate diagnosis at risk and consequently lead to inappropriate clinical management and overuse of healthcare resources. This study aims to address the challenges of traditional histopathological evaluation by utilizing matrix-assisted laser desorption ionization imaging mass spectrometry (MALDI IMS). Methods and Findings Formalin-fixed paraffin-embedded BCC and TB/TE tissue blocks were taken from archival tissue. A cohort of 69 BCC and TB/TE specimens were identified, each having three concordant diagnoses given by Dermatopathologists after a blinded analysis. H&E stained sections of each specimen were imaged for pathological analysis and uploaded to a digital annotation software with the following classifications: BCC, TB, TE, BCC stroma, TB stroma, and TE stroma. Mass spectra were collected from unstained serial sections guided by the areas annotated by the Dermatopathologists on the H&E stained serial sections. Before informatics, the data from the cohort were divided randomly into a training set (n=55) and a validation set (n=14). Prediction models were developed using a support vector machine (SVM) classification model from the training set data. The platform predicted BCC and TB/TE in model 2 (tumor nests alone) with a sensitivity of 98.9% (95% CI 98.3-99.4%) and specificity of 88.4% (95% CI 78.4-94.5%) at the spectral level in the validation set. Model 1 (stroma alone) had a sensitivity of 46.1% (95% CI 43.0-49.1%) and specificity of 99.2% (95% CI 97.1-99.9%). A combined model 3 (tumor nests and stroma) had a sensitivity of 90.26% (95% CI 89.1%-91.3%) and a specificity of 97.1% (95% CI 94.6% to 98.7%). The limitations of this study included a small sample set, which included easily identifiable cases obtained from a single tissue source. Conclusions Our study proves that BCC and TB/TE exhibit different proteomic profiles that one can use to enable accurate differential diagnosis. While our findings are not yet validated for clinical use, this merits further research to support IMS as an ancillary diagnostic tool for adequately and efficiently identifying the most common cutaneous malignancy in the United States. We recommend that future studies obtain a more extensive set of histologically challenging cases from multiple institutions and adequate clinical follow-up to confirm diagnostic accuracy. Documentation: Reading MSI Spot Data H5 Files Overview This document describes the structure of the H5 files created by the Mass Spectrometry Imaging (MSI) data processing pipeline and provides examples of how to read these files using both Python and R. H5 File Structure Each H5 file contains the following datasets: intensity: A 2D array (matrix) containing intensity values for each spot and m/z value Dimensions: [number_of_spots × number_of_m/z_values] Data type: 32-bit floating point (float32) Compression: GZIP (level 1) mz_values: A 1D array containing the m/z values Dimensions: [number_of_m/z_values] Data type: 32-bit floating point (float32) spot_names: A 1D array containing the spot names/identifiers Dimensions: [number_of_spots] Data type: ASCII string (bytes) Reading the H5 Files in Python Using h5py import h5py import numpy as np import matplotlib.pyplot as plt # Open the H5 file file_path = "sample-MSI-spot-data.h5" with h5py.File(file_path, 'r') as f: # Read the datasets intensity_array = f['intensity'][:] # Read the full intensity array mz_values = f['mz_values'][:] # Read the m/z values spot_names = f['spot_names'][:] # Read the spot names # Convert byte strings to regular strings if needed spot_names = [name.decode('utf-8') for name in spot_names] # Print basic information print(f"Number of spots: {intensity_array.shape[0]}") print(f"Number of m/z values: {intensity_array.shape[1]}") print(f"First few m/z values: {mz_values[:5]}") print(f"First few spot names: {spot_names[:5]}") # Example: Extract spectrum for first spot plt.figure(figsize=(10, 6)) plt.plot(mz_values, intensity_array[0, :]) plt.xlabel('m/z') plt.ylabel('Intensity') plt.title(f'Mass Spectrum for Spot: {spot_names[0]}') plt.show() # Example: Extract intensity for a specific m/z value across all spots target_mz = 500.0 # Replace with your m/z of interest closest_mz_idx = np.abs(mz_values - target_mz).argmin() actual_mz = mz_values[closest_mz_idx] print(f"Closest m/z to {target_mz} is {actual_mz}") intensity_at_mz = intensity_array[:, closest_mz_idx] # Now intensity_at_mz contains the intensity for that m/z across all spots Reading the H5 Files in R Using rhdf5 # Install required packages if not already installed if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("rhdf5", quietly = TRUE)) BiocManager::install("rhdf5") library(rhdf5) library(ggplot2) # Set the file path file_path <- "sample-MSI-spot-data.h5" # Read the datasets intensity_array <- h5read(file_path, "intensity") mz_values <- h5read(file_path, "mz_values") spot_names <- h5read(file_path, "spot_names") # Convert spot names from raw bytes to character strings spot_names <- sapply(spot_names, rawToChar) # Print basic information cat("Number of spots:", dim(intensity_array)[1], "\n") cat("Number of m/z values:", dim(intensity_array)[2], "\n") cat("First few m/z values:", head(mz_values, 5), "\n") cat("First few spot names:", head(spot_names, 5), "\n") # Example: Extract spectrum for first spot first_spot_spectrum <- intensity_array[1,] spectrum_data <- data.frame(mz = mz_values, intensity = first_spot_spectrum) # Plot the spectrum ggplot(spectrum_data, aes(x = mz, y = intensity)) + geom_line() + labs(title = paste("Mass Spectrum for Spot:", spot_names[1]), x = "m/z", y = "Intensity") + theme_minimal() # Example: Find intensity for a specific m/z value across all spots target_mz <- 500.0 closest_mz_idx <- which.min(abs(mz_values - target_mz)) actual_mz <- mz_values[closest_mz_idx] cat("Closest m/z to", target_mz, "is", actual_mz, "\n") intensity_at_mz <- intensity_array[, closest_mz_idx] # Now intensity_at_mz contains the intensity for that m/z across all spots Additional Information The intensity values represent the intensity of the mass spectrometry signal for each m/z value at each spot. Spot names typically include information about the location of the spot on the sample and be cross referenced with the spot information in the accompanying training and test .csv files. The m/z values represent the mass-to-charge ratio of the detected ions. This H5 file format allows for efficient storage and retrieval of large MSI datasets. Troubleshooting If you encounter issues reading the file: Ensure the H5 file exists at the specified path. Verify that you have the correct version of h5py (Python) or rhdf5 (R) installed. For large files, ensure you have sufficient memory available.

提供机构：

Zenodo

创建时间：

2025-03-30