Data from: In silico phylogenomics using complete genomes: a case study on the evolution of hominoids

DataONE2016-07-14 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

The increasing availability of complete genome data is facilitating the acquisition of phylogenomic datasets, but the process of obtaining orthologous sequences from other genomes and assembling multiple sequence alignments remains piecemeal and arduous. We designed software that performs these tasks and outputs anonymous loci (AL) or anchor loci (AE/UCE) datasets in ready-to-analyze formats. We demonstrate our program by applying it to the hominoids. Starting with human, chimpanzee, gorilla, and orangutan genomes, our software generated an exhaustive dataset of 292 ALs (~1 kb each) in ~3 hours. Analyses of our AL dataset not only validated the program by yielding a portrait of hominoid evolution in agreement with previous studies, but the accuracy and precision of our estimated ancestral effective population sizes and speciation times represent improvements. We also used our program with a published set of 512 vertebrate-wide AE 'probe' sequences to generate datasets consisting of 171 and 242 independent loci (~1 kb each) in 11 and 13 minutes, respectively. The former dataset consisted of flanking sequences 500 bp from adjacent AEs, while the latter contained sequences bordering AEs. Although our AE datasets produced the expected hominoid species tree, coalescent-based estimates of ancestral population sizes and speciation times based on these data were considerably lower than estimates from our AL dataset and previous studies. Accordingly, we suggest that loci subjected to direct or indirect selection may not be appropriate for coalescent-based methods. Complete in silico approaches, combined with the burgeoning genome databases, will accelerate the pace of phylogenomics.

完整基因组数据的可获取性持续提升，推动了系统发育基因组数据集（phylogenomic datasets）的获取，但从其他基因组中获取直系同源序列（orthologous sequences）并组装多序列比对（multiple sequence alignments）的流程，仍显得零散且繁重。我们开发了一款可执行上述任务的软件，能够以可直接用于分析的格式输出匿名基因座（anonymous loci, AL）或锚定基因座（anchor loci, AE/UCE）数据集。我们以人猿总科（hominoids）为研究对象对该软件进行了验证：以人类、黑猩猩、大猩猩和红毛猩猩的基因组为初始数据，软件在约3小时内生成了包含292个匿名基因座的全面数据集（每个基因座长度约1千碱基对，kb）。对该匿名基因座数据集的分析，不仅得到了与既往研究一致的人猿总科演化图景，验证了软件的可靠性，同时我们估算的祖先有效种群大小与物种形成时间，在准确性和精度上均有所提升。我们还将软件与已发表的一套覆盖脊椎动物全类群的512个AE"探针"序列结合使用，分别仅用11分钟和13分钟，就生成了包含171个和242个独立基因座的数据集（每个基因座长度约1千碱基对）。前者数据集包含紧邻AE两侧各500碱基对（bp）的侧翼序列，后者则包含与AE相邻的序列区域。尽管我们的AE数据集得到了预期的人猿总科物种树，但基于这些数据得到的溯祖理论（coalescent-based）祖先种群大小与物种形成时间估算值，远低于我们从匿名基因座数据集以及既往研究中得到的估算结果。据此我们认为，经受直接或间接选择的基因座，或许并不适用于溯祖理论分析方法。结合快速增长的基因组数据库，完整的计算机实验方法将进一步加快系统发育基因组学的发展步伐。

创建时间：

2016-07-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集