five

Simulation dataset for Clustering-based loci subsampling helps for robust phylogenomics

收藏
DataCite Commons2024-12-07 更新2025-01-06 收录
下载链接:
https://figshare.com/articles/dataset/Simulation_dataset_for_Clustering-based_loci_subsampling_helps_for_robust_phylogenomics/27873999/3
下载链接
链接失效反馈
官方服务:
资源简介:
Phylogenomic subsampling helps to mitigate data heterogeneity or reduce computational burden by selecting a subset of loci from genome-scale datasets for phylogenetic analysis. This approach typically involves filtering loci based on various properties of loci but lacks a consensus in practice. We proposed a novel multivariate, clustering-based approach that combines non-linear manifold learning UMAP dimension reduction with hierarchical clustering to identify and sort clusters based on their metrics possessing phylogenetic signal. Testing across 60 simulated and 16 empirical datasets, we evaluated its performance under varying data conditions including data source, subsampling size, species tree methods, and levels of incomplete lineage sorting (ILS). Combined summary statistics from both alignments and gene trees were preferred for clustering analyses, while statistics derived solely from gene trees (edge branch lengths and supports) were found to be less reliable, particularly at higher ILS levels. Subsampling strategies based on summary statistics or gene—species tree discordance consistently outperformed alternative approaches across all datasets. In empirical datasets, our method improved species tree accuracy even when filtering substantial portions of loci, with minimal adverse effects. Our clustering-based approach effectively identifies and retains loci with strong phylogenetic signals while excluding problematic sequences, demonstrating the potential for enhancing phylogenetic inference accuracy.
提供机构:
figshare
创建时间:
2024-12-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作