five

Datasets used for model demonstrations.

收藏
Figshare2024-05-20 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_used_for_model_demonstrations_/25863798
下载链接
链接失效反馈
官方服务:
资源简介:
Random forests have emerged as a promising tool in comparative metagenomics because they can predict environmental characteristics based on microbial composition in datasets where β-diversity metrics fall short of revealing meaningful relationships between samples. Nevertheless, despite this efficacy, they lack biological insight in tandem with their predictions, potentially hindering scientific advancement. To overcome this limitation, we leverage a geometric characterization of random forests to introduce a data-driven phylogenetic β-diversity metric, the adaptive Haar-like distance. This new metric assigns a weight to each internal node (i.e., split or bifurcation) of a reference phylogeny, indicating the relative importance of that node in discerning environmental samples based on their microbial composition. Alongside this, a weighted nearest-neighbors classifier, constructed using the adaptive metric, can be used as a proxy for the random forest while maintaining accuracy on par with that of the original forest and another state-of-the-art classifier, CoDaCoRe. As shown in datasets from diverse microbial environments, however, the new metric and classifier significantly enhance the biological interpretability and visualization of high-dimensional metagenomic samples.

随机森林(Random Forest)已成为比较宏基因组学领域极具应用前景的工具:在β多样性度量(β-diversity metrics)难以揭示样本间有意义关联的数据集中,它可基于微生物组成预测环境特征。尽管效能出众,但随机森林的预测结果往往缺乏生物学层面的阐释性,这可能会阻碍科学研究的推进。为克服这一局限,我们借助随机森林的几何特征,提出了一种数据驱动型系统发育β多样性度量(phylogenetic β-diversity metric)——自适应类Haar距离(adaptive Haar-like distance)。该新度量为参考系统发育树的每个内部节点(即分裂或分叉事件)分配权重,用以表征该节点在基于微生物组成区分环境样本时的相对重要性。与此同时,利用该自适应度量构建的加权最近邻分类器(weighted nearest-neighbors classifier),可作为随机森林的替代方案,且其分类准确度可与原始随机森林及另一款前沿分类器CoDaCoRe不相上下。正如来自多样微生物环境的数据集所证实的那样,该新度量与分类器可显著提升高维宏基因组样本的生物学可解释性与可视化效果。
创建时间:
2024-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作