Data from: Identification and qualification of 500 nuclear, single-copy, orthologous genes for the Eupulmonata (Gastropoda) using transcriptome sequencing and exon-capture

DataONE2016-05-24 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

The qualification of orthology is a significant challenge when developing large, multi-loci phylogenetic datasets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts, and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete gene data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon-capture of the 500 genes for 22 species of Australian Camaenidae successfully captured sequences of 2,825 exons, with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and Agalma-equivalent dataset (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a broad range of evolutionary depths, and highlights the importance of addressing fragmentation at the homolog alignment stage.

从组装转录本构建大型多基因座系统发育数据集时，直系同源（orthology）判定是一项极具挑战性的工作。转录组组装存在片段化、移码、索引错误等多种特征，会对直系同源评估的自动化方法造成干扰。本研究针对陆生蜗牛与蛞蝓（Eupulmonata），采用涵盖手动比对校正、基因树评估及基因组DNA测序验证的严谨直系同源判定流程，从其转录组组装结果中筛选得到一套直系同源单拷贝基因集。我们对21个肺螺类物种转录组组装结果中的500个核编码蛋白基因进行了直系同源性验证，最终构建出目前为止在分类群覆盖度与特征覆盖度两方面均最为完整的软体动物主要支系基因数据矩阵。针对22种澳洲坚螺科（Camaenidae）物种开展的500个基因外显子捕获实验，成功获取了2825个外显子的序列；仅因存在疑似旁系同源基因或假基因，导致数据矩阵缩减3.7%。自动化分析流程Agalma成功召回了手动验证的500个单拷贝基因集的大部分序列，并额外筛选出375个疑似单拷贝基因，但该流程未考虑转录本片段化问题，导致数据矩阵覆盖度偏低。这一缺陷或可解释我们在手动校正数据集与Agalma等效数据集（共享458个基因）中，21个肺螺类物种系统发育支持拓扑结构间存在的细微不一致。综上，本研究证实了这套500基因集可有效解析不同演化深度下的系统发育关系，并强调了在同源序列比对阶段处理转录本片段化问题的重要性。

创建时间：

2016-05-24