Open format raw data for MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study
收藏Zenodo2025-03-30 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.15064821
下载链接
链接失效反馈官方服务:
资源简介:
Raw data in HDF5 format accompanying "MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study".
Abstract
Background
Basal cell carcinoma (BCC) comprises a large portion of dermatopathology specimens; however, benign mimics such as trichoblastoma/trichoepithelioma (TB/TE) place accurate diagnosis at risk and consequently lead to inappropriate clinical management and overuse of healthcare resources. This study aims to address the challenges of traditional histopathological evaluation by utilizing matrix-assisted laser desorption ionization imaging mass spectrometry (MALDI IMS).
Methods and Findings
Formalin-fixed paraffin-embedded BCC and TB/TE tissue blocks were taken from archival tissue. A cohort of 69 BCC and TB/TE specimens were identified, each having three concordant diagnoses given by Dermatopathologists after a blinded analysis. H&E stained sections of each specimen were imaged for pathological analysis and uploaded to a digital annotation software with the following classifications: BCC, TB, TE, BCC stroma, TB stroma, and TE stroma. Mass spectra were collected from unstained serial sections guided by the areas annotated by the Dermatopathologists on the H&E stained serial sections. Before informatics, the data from the cohort were divided randomly into a training set (n=55) and a validation set (n=14). Prediction models were developed using a support vector machine (SVM) classification model from the training set data.
The platform predicted BCC and TB/TE in model 2 (tumor nests alone) with a sensitivity of 98.9% (95% CI 98.3-99.4%) and specificity of 88.4% (95% CI 78.4-94.5%) at the spectral level in the validation set. Model 1 (stroma alone) had a sensitivity of 46.1% (95% CI 43.0-49.1%) and specificity of 99.2% (95% CI 97.1-99.9%). A combined model 3 (tumor nests and stroma) had a sensitivity of 90.26% (95% CI 89.1%-91.3%) and a specificity of 97.1% (95% CI 94.6% to 98.7%). The limitations of this study included a small sample set, which included easily identifiable cases obtained from a single tissue source.
Conclusions
Our study proves that BCC and TB/TE exhibit different proteomic profiles that one can use to enable accurate differential diagnosis. While our findings are not yet validated for clinical use, this merits further research to support IMS as an ancillary diagnostic tool for adequately and efficiently identifying the most common cutaneous malignancy in the United States. We recommend that future studies obtain a more extensive set of histologically challenging cases from multiple institutions and adequate clinical follow-up to confirm diagnostic accuracy.
Documentation: Reading MSI Spot Data H5 Files
Overview
This document describes the structure of the H5 files created by the Mass Spectrometry Imaging (MSI) data processing pipeline and provides examples of how to read these files using both Python and R.
H5 File Structure
Each H5 file contains the following datasets:
intensity: A 2D array (matrix) containing intensity values for each spot and m/z value
Dimensions: [number_of_spots × number_of_m/z_values]
Data type: 32-bit floating point (float32)
Compression: GZIP (level 1)
mz_values: A 1D array containing the m/z values
Dimensions: [number_of_m/z_values]
Data type: 32-bit floating point (float32)
spot_names: A 1D array containing the spot names/identifiers
Dimensions: [number_of_spots]
Data type: ASCII string (bytes)
Reading the H5 Files in Python
Using h5py
import h5py
import numpy as np
import matplotlib.pyplot as plt
# Open the H5 file
file_path = "sample-MSI-spot-data.h5"
with h5py.File(file_path, 'r') as f:
# Read the datasets
intensity_array = f['intensity'][:] # Read the full intensity array
mz_values = f['mz_values'][:] # Read the m/z values
spot_names = f['spot_names'][:] # Read the spot names
# Convert byte strings to regular strings if needed
spot_names = [name.decode('utf-8') for name in spot_names]
# Print basic information
print(f"Number of spots: {intensity_array.shape[0]}")
print(f"Number of m/z values: {intensity_array.shape[1]}")
print(f"First few m/z values: {mz_values[:5]}")
print(f"First few spot names: {spot_names[:5]}")
# Example: Extract spectrum for first spot
plt.figure(figsize=(10, 6))
plt.plot(mz_values, intensity_array[0, :])
plt.xlabel('m/z')
plt.ylabel('Intensity')
plt.title(f'Mass Spectrum for Spot: {spot_names[0]}')
plt.show()
# Example: Extract intensity for a specific m/z value across all spots
target_mz = 500.0 # Replace with your m/z of interest
closest_mz_idx = np.abs(mz_values - target_mz).argmin()
actual_mz = mz_values[closest_mz_idx]
print(f"Closest m/z to {target_mz} is {actual_mz}")
intensity_at_mz = intensity_array[:, closest_mz_idx]
# Now intensity_at_mz contains the intensity for that m/z across all spots
Reading the H5 Files in R
Using rhdf5
# Install required packages if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("rhdf5", quietly = TRUE))
BiocManager::install("rhdf5")
library(rhdf5)
library(ggplot2)
# Set the file path
file_path <- "sample-MSI-spot-data.h5"
# Read the datasets
intensity_array <- h5read(file_path, "intensity")
mz_values <- h5read(file_path, "mz_values")
spot_names <- h5read(file_path, "spot_names")
# Convert spot names from raw bytes to character strings
spot_names <- sapply(spot_names, rawToChar)
# Print basic information
cat("Number of spots:", dim(intensity_array)[1], "\n")
cat("Number of m/z values:", dim(intensity_array)[2], "\n")
cat("First few m/z values:", head(mz_values, 5), "\n")
cat("First few spot names:", head(spot_names, 5), "\n")
# Example: Extract spectrum for first spot
first_spot_spectrum <- intensity_array[1,]
spectrum_data <- data.frame(mz = mz_values, intensity = first_spot_spectrum)
# Plot the spectrum
ggplot(spectrum_data, aes(x = mz, y = intensity)) +
geom_line() +
labs(title = paste("Mass Spectrum for Spot:", spot_names[1]),
x = "m/z",
y = "Intensity") +
theme_minimal()
# Example: Find intensity for a specific m/z value across all spots
target_mz <- 500.0
closest_mz_idx <- which.min(abs(mz_values - target_mz))
actual_mz <- mz_values[closest_mz_idx]
cat("Closest m/z to", target_mz, "is", actual_mz, "\n")
intensity_at_mz <- intensity_array[, closest_mz_idx]
# Now intensity_at_mz contains the intensity for that m/z across all spots
Additional Information
The intensity values represent the intensity of the mass spectrometry signal for each m/z value at each spot.
Spot names typically include information about the location of the spot on the sample and be cross referenced with the spot information in the accompanying training and test .csv files.
The m/z values represent the mass-to-charge ratio of the detected ions.
This H5 file format allows for efficient storage and retrieval of large MSI datasets.
Troubleshooting
If you encounter issues reading the file:
Ensure the H5 file exists at the specified path.
Verify that you have the correct version of h5py (Python) or rhdf5 (R) installed.
For large files, ensure you have sufficient memory available.
提供机构:
Zenodo
创建时间:
2025-03-30



