five

Integrating urinary metabolomics and clinical datasets for multi-cancer detection

收藏
DataCite Commons2025-11-26 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Integrating_urinary_metabolomics_and_clinical_datasets_for_multi-cancer_detection/30716096
下载链接
链接失效反馈
官方服务:
资源简介:
## Background<br>This dataset contains raw urinary surface-enhanced Raman scattering (SERS) spectra acquired from participants with cardiometabolic conditions and solid cancers, as well as non-disease controls. The data are intended for method development and benchmarking of machine-learning based diagnostic models.<br>## Study design and groups<br>- **Sample type:** spot urine- **Measurement:** surface-enhanced Raman scattering (SERS), [instrument model / laser wavelength / objective / integration time / SERS substrate: to be filled by data owner]- **Technical replicates:** 5 SERS acquisitions per subject on the same specimen- **Groups and sample sizes (subjects × replicates):** - Normal controls: 100 × 5 = 500 spectra - Hypertension (HTN): 100 × 5 = 500 spectra - Diabetes mellitus (DM): 100 × 5 = 500 spectra - Hypertension + Diabetes (HTN+DM): 100 × 5 = 500 spectra - Colorectal cancer (CRC): 300 × 5 = 1,500 spectra - Lung cancer: 200 × 5 = 1,000 spectra - Pancreatic cancer: 53 × 5 = 265 spectra - **Total:** 953 subjects, 4,765 spectra<br>## File organization<br>The dataset is organized into seven zip archives, each corresponding to one clinical group, plus a metadata file:<br>- `normal_SERS.zip` - Contains 500 CSV files under the folder `normal_SERS/` - File naming pattern: `NOR _.CSV`- `HTN_SERS.zip` - Contains 500 CSV files under the folder `HTN_SERS/` - File naming pattern: `HBP _.CSV`- `DM_SERS.zip` - Contains 500 CSV files under the folder `DM_SERS/` - File naming pattern: `DIA _.CSV`- `HTN+DM_SERS.zip` - Contains 500 CSV files under the folder `HTN+DM_SERS/` - File naming pattern: `H.D. _.CSV`- `colorectal+cancer_SERS.zip` - Contains 1,500 CSV files under the folder `colorectal+cancer_SERS/` - File naming pattern: `CRC _.CSV`- `lung+cancer_SERS.zip` - Contains 1,000 CSV files under the folder `lung+cancer_SERS/` - File naming pattern: `LUN _.CSV`- `pancreatic+cancer_SERS.zip` - Contains 265 CSV files under the folder `pancreatic+cancer_SERS/` - File naming pattern: `SPAN _.CSV`<br>- `sample_metadata.csv` - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index.<br>## `sample_metadata.csv` columns<br>The `sample_metadata.csv` file has one row per SERS spectrum (4,765 rows in total) and the following columns:<br>- `group`: descriptive group label - e.g., `Normal control`, `Hypertension`, `Diabetes mellitus`, `Hypertension + Diabetes`, `Colorectal cancer`, `Lung cancer`, `Pancreatic cancer`.- `group_code`: short group code - e.g., `Normal`, `HTN`, `DM`, `HTN+DM`, `CRC`, `LungCA`, `PancreasCA`.- `original_prefix`: prefix as it appears in the original file names - `NOR`, `HBP`, `DIA`, `H.D.`, `CRC`, `LUN`, `SPAN`.- `canonical_prefix`: cleaned/standardized prefix used for constructing `sample_id` - `NOR`, `HBP`, `DIA`, `HD`, `CRC`, `LUN`, `SPAN`. - For example, `H.D.` → `HD`.- `subject_id`: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- `sample_id`: standardized subject identifier combining `canonical_prefix` and zero-padded `subject_id` - e.g., `NOR_001`, `HBP_093`, `DIA_048`, `HD_027`, `CRC_077`, `LUN_151`, `SPAN_022`.- `replicate_index`: technical replicate index (1–5).- `filename`: original CSV file name (e.g., `HBP 93_5.CSV`).- `filepath_in_zip`: relative path to the CSV file inside the corresponding zip archive (e.g., `HTN_SERS/HBP 93_5.CSV`).- `zip_file`: name of the zip archive that contains this file (e.g., `HTN_SERS.zip`).<br>## Data format<br>- Each CSV file contains **two columns** without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered **raw** measurements.<br>## Recommended usage<br>This dataset is suitable for:<br>- Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation.<br>Users are encouraged to:<br>- Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.<br>
提供机构:
figshare
创建时间:
2025-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作