Identification and qualification of 500 nuclear, single-copy, orthologous genes for the Eupulmonata (Gastropoda) using transcriptome sequencing and exon capture

NIAID Data Ecosystem2026-03-09 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.fn627

下载链接

链接失效反馈

官方服务：

资源简介：

The qualification of orthology is a significant challenge when developing large, multiloci phylogenetic data sets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein-coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete phylogenetic data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon capture targeting 490 of the 500 genes (those with at least one exon >120 bp) from 22 species of Australian Camaenidae successfully captured sequences of 2825 exons (representing all targeted genes), with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness when considering the original 500 genes. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and ‘Agalma-equivalent’ data set (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a range of evolutionary depths and highlights the importance of addressing fragmentation at the homolog alignment stage for probe design.

从组装转录本构建大型多位点系统发育数据集时，直系同源（orthology）的判定是一项重大挑战。转录组组装存在碎片化、移码突变、索引错误等多种缺陷，这些问题会对直系同源评估的自动化方法造成干扰。本研究针对陆生蜗牛与蛞蝓（真肺类，Eupulmonata），采用涵盖手动比对校正、基因树评估以及基因组DNA测序的严谨直系同源判定方法，从其转录组组装结果中鉴定出一套直系同源单拷贝基因。我们对21个真肺类物种转录组组装结果中的500个核编码蛋白基因进行了直系同源性判定，最终构建出截至目前在分类单元和特征完整性两方面均表现最优的软体动物主要支系系统发育数据矩阵。针对澳大利亚坚齿螺科（Camaenidae）22个物种的500个基因中的490个（即至少拥有1个长度大于120bp外显子的基因）开展外显子捕获实验，成功获取了2825个外显子的序列（覆盖所有目标基因），仅因存在推定旁系同源基因或假基因导致数据矩阵规模缩减3.7%。自动化分析流程Agalma成功召回了经手动判定的500个单拷贝基因集合中的绝大多数，并额外鉴定出375个推定单拷贝基因；但该流程未考虑转录本碎片化问题，导致在使用原始500个基因进行分析时，数据矩阵完整性有所下降。这一缺陷或可解释我们在手动校正数据集与“等效Agalma”数据集（二者共享458个基因）中，21个真肺类物种的支持拓扑结构所存在的细微不一致现象。总体而言，本研究证实了这套500个基因集合在解析不同进化深度下的系统发育关系中的实用性，并强调了在同源序列比对阶段处理转录本碎片化问题对探针设计的重要性。

创建时间：

2016-05-24