Integrating urinary metabolomics and clinical datasets for multi-cancer detection

Name: Integrating urinary metabolomics and clinical datasets for multi-cancer detection
Creator: figshare
Published: 2025-11-26 01:59:00
License: 暂无描述

DataCite Commons2025-11-26 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/Integrating_urinary_metabolomics_and_clinical_datasets_for_multi-cancer_detection/30716096

下载链接

链接失效反馈

官方服务：

资源简介：

## Background This dataset contains raw urinary surface-enhanced Raman scattering (SERS) spectra acquired from participants with cardiometabolic conditions and solid cancers, as well as non-disease controls. The data are intended for method development and benchmarking of machine-learning based diagnostic models. ## Study design and groups - **Sample type:** spot urine- **Measurement:** surface-enhanced Raman scattering (SERS), [instrument model / laser wavelength / objective / integration time / SERS substrate: to be filled by data owner]- **Technical replicates:** 5 SERS acquisitions per subject on the same specimen- **Groups and sample sizes (subjects × replicates):** - Normal controls: 100 × 5 = 500 spectra - Hypertension (HTN): 100 × 5 = 500 spectra - Diabetes mellitus (DM): 100 × 5 = 500 spectra - Hypertension + Diabetes (HTN+DM): 100 × 5 = 500 spectra - Colorectal cancer (CRC): 300 × 5 = 1,500 spectra - Lung cancer: 200 × 5 = 1,000 spectra - Pancreatic cancer: 53 × 5 = 265 spectra - **Total:** 953 subjects, 4,765 spectra ## File organization The dataset is organized into seven zip archives, each corresponding to one clinical group, plus a metadata file: - `normal_SERS.zip` - Contains 500 CSV files under the folder `normal_SERS/` - File naming pattern: `NOR _.CSV`- `HTN_SERS.zip` - Contains 500 CSV files under the folder `HTN_SERS/` - File naming pattern: `HBP _.CSV`- `DM_SERS.zip` - Contains 500 CSV files under the folder `DM_SERS/` - File naming pattern: `DIA _.CSV`- `HTN+DM_SERS.zip` - Contains 500 CSV files under the folder `HTN+DM_SERS/` - File naming pattern: `H.D. _.CSV`- `colorectal+cancer_SERS.zip` - Contains 1,500 CSV files under the folder `colorectal+cancer_SERS/` - File naming pattern: `CRC _.CSV`- `lung+cancer_SERS.zip` - Contains 1,000 CSV files under the folder `lung+cancer_SERS/` - File naming pattern: `LUN _.CSV`- `pancreatic+cancer_SERS.zip` - Contains 265 CSV files under the folder `pancreatic+cancer_SERS/` - File naming pattern: `SPAN _.CSV` - `sample_metadata.csv` - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index. ## `sample_metadata.csv` columns The `sample_metadata.csv` file has one row per SERS spectrum (4,765 rows in total) and the following columns: - `group`: descriptive group label - e.g., `Normal control`, `Hypertension`, `Diabetes mellitus`, `Hypertension + Diabetes`, `Colorectal cancer`, `Lung cancer`, `Pancreatic cancer`.- `group_code`: short group code - e.g., `Normal`, `HTN`, `DM`, `HTN+DM`, `CRC`, `LungCA`, `PancreasCA`.- `original_prefix`: prefix as it appears in the original file names - `NOR`, `HBP`, `DIA`, `H.D.`, `CRC`, `LUN`, `SPAN`.- `canonical_prefix`: cleaned/standardized prefix used for constructing `sample_id` - `NOR`, `HBP`, `DIA`, `HD`, `CRC`, `LUN`, `SPAN`. - For example, `H.D.` → `HD`.- `subject_id`: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- `sample_id`: standardized subject identifier combining `canonical_prefix` and zero-padded `subject_id` - e.g., `NOR_001`, `HBP_093`, `DIA_048`, `HD_027`, `CRC_077`, `LUN_151`, `SPAN_022`.- `replicate_index`: technical replicate index (1–5).- `filename`: original CSV file name (e.g., `HBP 93_5.CSV`).- `filepath_in_zip`: relative path to the CSV file inside the corresponding zip archive (e.g., `HTN_SERS/HBP 93_5.CSV`).- `zip_file`: name of the zip archive that contains this file (e.g., `HTN_SERS.zip`). ## Data format - Each CSV file contains **two columns** without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered **raw** measurements. ## Recommended usage This dataset is suitable for: - Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation. Users are encouraged to: - Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.

提供机构：

figshare

创建时间：

2025-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集