FeatureHunter
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/bhtfp3sk4g
下载链接
链接失效反馈官方服务:
资源简介:
This dataset supports the study “Deciphering the Carcinogenic and Prognostic Implications of Food Colorants through Network Toxicology, Machine Learning, and Deep Learning”.
The research hypothesis of this work is that synthetic food colorants—though widely used and considered safe at dietary levels—may interact with key oncogenic pathways through molecular binding and transcriptional modulation, contributing to cancer risk and prognosis heterogeneity. To test this hypothesis, we integrated toxicogenomic, molecular, and transcriptomic evidence into a reproducible computational workflow for biomarker discovery and risk modeling.
The data provided here include processed transcriptomic matrices, curated label files, model outputs, and reproducible R scripts used to identify diagnostic and prognostic gene signatures. Specifically, the DatasetA–D.txt files correspond to TCGA training, validation, and two independent GEO test cohorts. The scripts (Testflight.R, Analysis-COAD.R, and Analysis-LUSC.R) implement the full FeatureHunter framework—an open-source R package designed for interpretable multi-model benchmarking and feature-importance fusion. Among them, Testflight.R serves as a universal demonstration pipeline, while Analysis-COAD.R and Analysis-LUSC.R are optimized for colorectal adenocarcinoma (COAD) and lung squamous cell carcinoma (LUSC), respectively.
All results generated by these scripts are organized in the output/ folder, containing feature-importance tables, visualization figures (bar plots, UMAPs, stability heatmaps), and diagnostic model summaries. The SessionInfo.txt file records the R version and package environment to ensure reproducibility.
Notably, FeatureHunter does not introduce new machine-learning models but integrates existing algorithms (LASSO, SVM, RF, MLP, etc.) into a unified, interpretable system. A key innovation is the implementation of adaptive imbalance correction for deep neural networks: when training data exhibit extreme class imbalance, the algorithm automatically adjusts decision thresholds and batch sampling ratios. In addition, an independent validation dataset is used to optimize cutoff thresholds for more robust model generalization.
These data can be used to reproduce the results reported in the manuscript, benchmark alternative feature-selection strategies, or extend the methodology to other biomedical datasets. All code and parameters are fully documented in the accompanying GitHub repository:
👉 https://github.com/ZackLiuzeyu/FeatureHunter
Together, this dataset enables full transparency and reproducibility of all computational analyses, supports validation of colorant–cancer associations, and provides a generalizable framework for interpretable machine-learning–based biomarker discovery in toxicogenomics.
创建时间:
2025-10-11



