Data from: Marker development for phylogenomics: the case of Orobanchaceae, a plant family with contrasting nutritional modes

DataONE2017-11-22 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Phylogenomic approaches, employing next-generation sequencing (NGS) techniques, have revolutionized systematic and evolutionary biology. Target enrichment is an efficient and cost-effective method in phylogenomics and is becoming increasingly popular. Depending on availability and quality of reference data as well as on biological features of the study system, (semi-)automated identification of suitable markers will require specific bioinformatic pipelines. Here, we established a highly flexible bioinformatic pipeline, BaitsFinder, to identify putative orthologous single copy genes (SCGs) and to construct bait sequences in a single workflow. Additionally, this pipeline has been constructed to be able to cope with challenging data sets, such as the nutritionally heterogeneous plant family Orobanchaceae. To this end, we used transcriptome data of differing quality available for four Orobanchaceae species and, as reference, SCG data from monkeyflower (Erythranthe guttata, syn. Mimulus g.; 1,915 genes) and tomato (Solanum lycopersicum; 391 genes). Depending on whether gaps were permitted in initial blast searches of the four Orobanchaceae species against the reference, our pipeline identified 1,307 and 981 SCGs with average length of 994 bp and 775 bp, respectively. Automated bait sequence construction (using 2× tiling) resulted in 38,170 and 21,856 bait sequences, respectively. In comparison to the recently published MarkerMiner 1.0 pipeline BaitsFinder identified about 1.6 times as many SCGs (of at least 900 bp length). Skipping steps specific to analyses of Orobanchaceae, BaitsFinder was successfully used in a group of non-parasitic plants (three Asteraceae species and, as reference, SCG data from Arabidopsis thaliana based on previously compiled SCGs). Thus, BaitsFinder is expected to be broadly applicable in groups, where only transcriptomes or partial genome data of differing quality are available.

系统发育基因组学（Phylogenomics）研究方法借助下一代测序技术（next-generation sequencing, NGS），彻底重塑了系统学与进化生物学的研究格局。目标富集（Target enrichment）是系统发育基因组学中高效且经济的技术手段，应用愈发广泛。依据参考数据的可获取性与质量，以及研究体系的生物学特性，（半）自动化筛选适配的分子标记需依托特定的生物信息学流程。本研究构建了一款灵活性极强的生物信息学流程BaitsFinder，可在单一工作流内完成推定直系同源单拷贝基因（single copy gene, SCG）的鉴定与诱饵序列的构建。此外，该流程专为处理挑战性数据集设计，例如营养类型多样的列当科（Orobanchaceae）植物类群。为此，我们使用了4种列当科植物的不同质量转录组数据，并以猴面花（Erythranthe guttata，异名Mimulus g.，共1915个基因）和番茄（Solanum lycopersicum，共391个基因）的单拷贝基因数据作为参考。针对4种列当科植物与参考序列的初始BLAST比对，根据是否允许序列缺口的不同设置，本流程分别鉴定出1307个和981个单拷贝基因，平均长度分别为994 bp与775 bp。采用2×平铺策略开展自动化诱饵序列构建后，分别得到38170条和21856条诱饵序列。与近期发表的MarkerMiner 1.0流程相比，BaitsFinder鉴定出的长度≥900 bp的单拷贝基因数量约为其1.6倍。移除针对列当科分析的专属步骤后，BaitsFinder已成功应用于非寄生植物类群：3种菊科（Asteraceae）植物，此时以基于已公开汇编单拷贝基因数据的拟南芥（Arabidopsis thaliana）单拷贝基因数据作为参考。综上，BaitsFinder可广泛适用于仅拥有不同质量转录组或部分基因组数据的生物类群。

创建时间：

2017-11-22