Integrating urinary metabolomics and clinical datasets for multi-cancer detection
收藏DataCite Commons2025-11-26 更新2026-05-03 收录
下载链接:
https://figshare.com/articles/dataset/Integrating_urinary_metabolomics_and_clinical_datasets_for_multi-cancer_detection/30716096/1
下载链接
链接失效反馈官方服务:
资源简介:
## Background<br>This dataset contains raw urinary surface-enhanced Raman scattering (SERS) spectra acquired from participants with cardiometabolic conditions and solid cancers, as well as non-disease controls. The data are intended for method development and benchmarking of machine-learning based diagnostic models.<br>## Study design and groups<br>- **Sample type:** spot urine- **Measurement:** surface-enhanced Raman scattering (SERS), [instrument model / laser wavelength / objective / integration time / SERS substrate: to be filled by data owner]- **Technical replicates:** 5 SERS acquisitions per subject on the same specimen- **Groups and sample sizes (subjects × replicates):** - Normal controls: 100 × 5 = 500 spectra - Hypertension (HTN): 100 × 5 = 500 spectra - Diabetes mellitus (DM): 100 × 5 = 500 spectra - Hypertension + Diabetes (HTN+DM): 100 × 5 = 500 spectra - Colorectal cancer (CRC): 300 × 5 = 1,500 spectra - Lung cancer: 200 × 5 = 1,000 spectra - Pancreatic cancer: 53 × 5 = 265 spectra - **Total:** 953 subjects, 4,765 spectra<br>## File organization<br>The dataset is organized into seven zip archives, each corresponding to one clinical group, plus a metadata file:<br>- `normal_SERS.zip` - Contains 500 CSV files under the folder `normal_SERS/` - File naming pattern: `NOR _.CSV`- `HTN_SERS.zip` - Contains 500 CSV files under the folder `HTN_SERS/` - File naming pattern: `HBP _.CSV`- `DM_SERS.zip` - Contains 500 CSV files under the folder `DM_SERS/` - File naming pattern: `DIA _.CSV`- `HTN+DM_SERS.zip` - Contains 500 CSV files under the folder `HTN+DM_SERS/` - File naming pattern: `H.D. _.CSV`- `colorectal+cancer_SERS.zip` - Contains 1,500 CSV files under the folder `colorectal+cancer_SERS/` - File naming pattern: `CRC _.CSV`- `lung+cancer_SERS.zip` - Contains 1,000 CSV files under the folder `lung+cancer_SERS/` - File naming pattern: `LUN _.CSV`- `pancreatic+cancer_SERS.zip` - Contains 265 CSV files under the folder `pancreatic+cancer_SERS/` - File naming pattern: `SPAN _.CSV`<br>- `sample_metadata.csv` - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index.<br>## `sample_metadata.csv` columns<br>The `sample_metadata.csv` file has one row per SERS spectrum (4,765 rows in total) and the following columns:<br>- `group`: descriptive group label - e.g., `Normal control`, `Hypertension`, `Diabetes mellitus`, `Hypertension + Diabetes`, `Colorectal cancer`, `Lung cancer`, `Pancreatic cancer`.- `group_code`: short group code - e.g., `Normal`, `HTN`, `DM`, `HTN+DM`, `CRC`, `LungCA`, `PancreasCA`.- `original_prefix`: prefix as it appears in the original file names - `NOR`, `HBP`, `DIA`, `H.D.`, `CRC`, `LUN`, `SPAN`.- `canonical_prefix`: cleaned/standardized prefix used for constructing `sample_id` - `NOR`, `HBP`, `DIA`, `HD`, `CRC`, `LUN`, `SPAN`. - For example, `H.D.` → `HD`.- `subject_id`: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- `sample_id`: standardized subject identifier combining `canonical_prefix` and zero-padded `subject_id` - e.g., `NOR_001`, `HBP_093`, `DIA_048`, `HD_027`, `CRC_077`, `LUN_151`, `SPAN_022`.- `replicate_index`: technical replicate index (1–5).- `filename`: original CSV file name (e.g., `HBP 93_5.CSV`).- `filepath_in_zip`: relative path to the CSV file inside the corresponding zip archive (e.g., `HTN_SERS/HBP 93_5.CSV`).- `zip_file`: name of the zip archive that contains this file (e.g., `HTN_SERS.zip`).<br>## Data format<br>- Each CSV file contains **two columns** without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered **raw** measurements.<br>## Recommended usage<br>This dataset is suitable for:<br>- Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation.<br>Users are encouraged to:<br>- Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.<br>
## 研究背景<br>本数据集包含来自心血管代谢疾病患者、实体恶性肿瘤患者以及健康对照受试者的原始尿液表面增强拉曼散射(Surface-enhanced Raman Scattering, SERS)光谱。该数据集旨在用于基于机器学习的诊断模型的方法开发与基准测试。<br>## 研究设计与分组<br>- **样本类型:** 即时尿液<br>- **检测方法:** 表面增强拉曼散射(SERS),[仪器型号/激光波长/物镜参数/积分时间/SERS基底:待数据提供者补充]<br>- **技术重复:** 每份样本对同一受试者进行5次SERS光谱采集<br>- **分组与样本量(受试者数 × 重复次数):**<br> - 健康对照:100 × 5 = 500条光谱<br> - 高血压(Hypertension, HTN):100 × 5 = 500条光谱<br> - 糖尿病(Diabetes mellitus, DM):100 × 5 = 500条光谱<br> - 高血压合并糖尿病(HTN+DM):100 × 5 = 500条光谱<br> - 结直肠癌(Colorectal cancer, CRC):300 × 5 = 1500条光谱<br> - 肺癌:200 × 5 = 1000条光谱<br> - 胰腺癌:53 × 5 = 265条光谱<br>- **总计:** 953名受试者,共4765条光谱<br>## 数据集文件组织<br>本数据集包含7个压缩归档文件,分别对应一个临床分组,另附带1个元数据文件:<br>- `normal_SERS.zip`:包含`normal_SERS/`文件夹下的500个CSV文件,文件命名格式:`NOR _.CSV`<br>- `HTN_SERS.zip`:包含`HTN_SERS/`文件夹下的500个CSV文件,文件命名格式:`HBP _.CSV`<br>- `DM_SERS.zip`:包含`DM_SERS/`文件夹下的500个CSV文件,文件命名格式:`DIA _.CSV`<br>- `HTN+DM_SERS.zip`:包含`HTN+DM_SERS/`文件夹下的500个CSV文件,文件命名格式:`H.D. _.CSV`<br>- `colorectal+cancer_SERS.zip`:包含`colorectal+cancer_SERS/`文件夹下的1500个CSV文件,文件命名格式:`CRC _.CSV`<br>- `lung+cancer_SERS.zip`:包含`lung+cancer_SERS/`文件夹下的1000个CSV文件,文件命名格式:`LUN _.CSV`<br>- `pancreatic+cancer_SERS.zip`:包含`pancreatic+cancer_SERS/`文件夹下的265个CSV文件,文件命名格式:`SPAN _.CSV`<br>- `sample_metadata.csv`:样本级元数据文件,用于关联每条光谱文件与其所属临床分组、受试者编号及重复序号。<br>## `sample_metadata.csv`字段说明<br>`sample_metadata.csv`文件为每条SERS光谱对应一行(总计4765行),包含以下字段:<br>- `"group"`:分组描述标签,示例值包括:`"Normal control"`、`"Hypertension"`、`"Diabetes mellitus"`、`"Hypertension + Diabetes"`、`"Colorectal cancer"`、`"Lung cancer"`、`"Pancreatic cancer"`<br>- `"group_code"`:分组短编码,示例值包括:`"Normal"`、`"HTN"`、`"DM"`、`"HTN+DM"`、`"CRC"`、`"LungCA"`、`"PancreasCA"`<br>- `"original_prefix"`:原始文件名中使用的前缀,对应值为:`"NOR"`、`"HBP"`、`"DIA"`、`"H.D."`、`"CRC"`、`"LUN"`、`"SPAN"`<br>- `"canonical_prefix"`:用于构建`"sample_id"`的标准化清洗后前缀,对应值为:`"NOR"`、`"HBP"`、`"DIA"`、`"HD"`、`"CRC"`、`"LUN"`、`"SPAN"`。例如:`"H.D."` 转换为 `"HD"`<br>- `"subject_id"`:每个分组内的整数型受试者编号(根据分组不同,取值范围为1~100、1~300、1~200或1~53)<br>- `"sample_id"`:标准化受试者编号,由`"canonical_prefix"`和补零后的`"subject_id"`组合而成,例如:`"NOR_001"`、`"HBP_093"`、`"DIA_048"`、`"HD_027"`、`"CRC_077"`、`"LUN_151"`、`"SPAN_022"`<br>- `"replicate_index"`:技术重复序号(取值范围1~5)<br>- `"filename"`:原始CSV文件名,例如:`"HBP 93_5.CSV"`<br>- `"filepath_in_zip"`:对应压缩归档内CSV文件的相对路径,例如:`"HTN_SERS/HBP 93_5.CSV"`<br>- `"zip_file"`:包含该文件的压缩归档文件名,例如:`"HTN_SERS.zip"`<br>## 数据格式<br>- 每个CSV文件包含**两列无表头数据**:1. 拉曼位移(单位:cm⁻¹),通常覆盖范围约为50~3300 cm⁻¹;2. SERS强度(任意单位)<br>- 所有光谱文件的有效数据行数(采样点数)均保持一致<br>- 未对光谱进行基线校正、平滑、归一化或其他信号处理操作,本数据集所提供的光谱均为**原始测量值**<br>## 推荐使用场景<br>本数据集适用于:<br>- 以下方向的方法开发与基准测试:<br> - 预处理算法(包括基线校正、去噪、归一化等)<br> - 面向SERS光谱的特征提取与降维方法<br> - 基于SERS光谱的诊断模型与多疾病分类模型<br>- 以下方向的方法学研究:<br> - 技术重复数据的处理策略<br> - 跨疾病模型泛化与域自适应研究<br><br>鼓励使用者:<br>- 自行实现预处理与验证策略,并清晰描述其实现细节<br>- 基于本数据集发表研究成果时,请明确报告训练/验证集划分、交叉验证方案以及性能指标等关键细节。
提供机构:
figshare
创建时间:
2025-11-26



