food-ai-nexus/salmonella-serovar-hyperspectral-spectra
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/food-ai-nexus/salmonella-serovar-hyperspectral-spectra
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
pretty_name: Salmonella Serovar Hyperspectral Spectra (Foods 2025)
tags:
- food-safety
- salmonella
- hyperspectral
- tabular-classification
- spectroscopy
task_categories:
- tabular-classification
configs:
- config_name: default
data_files:
- split: train
path: data/train.parquet
- split: test
path: data/test.parquet
---
# Salmonella Serovar Hyperspectral Spectra (Foods 2025)
Salmonella Serovar Hyperspectral Spectra is a tabular dataset of single-cell spectral features for foodborne bacterial classification. It was created to support research in rapid pathogen identification, enabling models to classify *Salmonella* serovars using hyperspectral signatures extracted from individual bacterial cells.
> **Companion image dataset:** The RGB composite microscopy images from which these spectra were extracted are available at [`food-ai-nexus/salmonella-serovar-hyperspectral`](https://huggingface.co/datasets/food-ai-nexus/salmonella-serovar-hyperspectral).
This dataset accompanies the publication: Papa, M., Bhattacharya, S., Park, B., & Yi, J. (2025). Rapid *Salmonella* Serovar Classification Using AI-Enabled Hyperspectral Microscopy with Enhanced Data Preprocessing and Multimodal Fusion. *Foods*, 14(15), 2737. doi: [10.3390/foods14152737](https://doi.org/10.3390/foods14152737)
## Dataset Description
Each row represents one bacterial cell segmented from a hyperspectral data cube (hypercube). Spectra are Standard Normal Variate (SNV)-normalized mean single-cell spectra across 303 wavebands (399–1000 nm, 2 nm bandwidth), extracted using an attention-gated residual U-Net (ARG2U-Net).
```python
from datasets import load_dataset
ds = load_dataset("food-ai-nexus/salmonella-serovar-hyperspectral-spectra")
# ds['train'] → 18,180 rows | ds['test'] → 7,792 rows
```
## Splits
The 70/30 train/test split is performed at the row level, stratified by `Serovar` (seed=42), mirroring the paper's reported methodology.
| Split | Rows | Notes |
| :--- | ---: | :--- |
| `train` | 18,180 | 70% stratified by serovar |
| `test` | 7,792 | 30% stratified by serovar |
## Schema
| Column | Type | Description |
| :--- | :--- | :--- |
| `InImage_ID` | int | Per-serovar cell index identifying the source hypercube |
| `Band_2_W_401.00` … `Band_303_W_1000.90` | float | SNV-normalized mean spectral reflectance at each waveband (nm) |
| `Serovar` | string | Target label: one of `Enteritidis`, `I4`, `Infantis`, `Johannesburg`, `Kentucky` |
> **Important note on `InImage_ID`:** This index identifies the source hypercube **within each serovar group**, not globally. It cannot be used as a direct foreign key to join rows to specific files in the companion image dataset.
## Classes
Five *Salmonella* serovars selected based on their prevalence in foodborne illness outbreaks:
| Label | Serovar | Train Rows | Test Rows |
| :--- | :--- | ---: | ---: |
| `Enteritidis` | *S.* Enteritidis | 3,731 | 1,600 |
| `I4` | *S.* 4,[5],12:i:- | 3,638 | 1,559 |
| `Infantis` | *S.* Infantis | 5,201 | 2,229 |
| `Johannesburg` | *S.* Johannesburg | 3,265 | 1,399 |
| `Kentucky` | *S.* Kentucky | 2,345 | 1,005 |
> **Note on class imbalance:** The spectra are inherently imbalanced because different serovars yield different numbers of segmentable cells per hypercube. This reflects biological variation in cell density and morphology, not a sampling artifact.
> **Known data issue:** One cell record (`InImage_ID=73`, Enteritidis) has NaN values for bands 148–303 (wavelengths 692–1001 nm) in the original Zenodo source CSV. This row is preserved as-is to maintain source fidelity. Users should apply appropriate imputation or filtering before training.
## Baseline Performance
| Model | Modality | Test Accuracy |
| :--- | :--- | ---: |
| PCA-MLP | Spectral only | 81.1% |
| PCA-MLP + EfficientNetV2 | Multimodal fusion (Image + Spectra) | **82.4%** |
## License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
## Citation
```bibtex
@article{papa2025salmonella,
title = {Rapid Salmonella Serovar Classification Using AI-Enabled Hyperspectral Microscopy with Enhanced Data Preprocessing and Multimodal Fusion},
author = {Papa, MeiLi and Bhattacharya, Siddhartha and Park, Bosoon and Yi, Jiyoon},
journal = {Foods},
volume = {14},
number = {15},
pages = {2737},
year = {2025},
doi = {10.3390/foods14152737}
}
```
## Source
Original dataset: [Zenodo 10.5281/zenodo.16740800](https://zenodo.org/records/16740800)
Code repository: [GitHub food-ai-engineering-lab/salmonella-serovar-classification-foods](https://github.com/food-ai-engineering-lab/salmonella-serovar-classification-foods)
提供机构:
food-ai-nexus



