Data_Sheet_2_Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.xlsx
收藏frontiersin.figshare.com2023-06-01 更新2025-03-23 收录
下载链接:
https://frontiersin.figshare.com/articles/dataset/Data_Sheet_2_Reads_Binning_Improves_the_Assembly_of_Viral_Genome_Sequences_From_Metagenomic_Samples_xlsx/14634405/1
下载链接
链接失效反馈官方服务:
资源简介:
Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus–host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.
宏基因组可以视为病毒、细菌及其他真核生物DNA序列的混合物。从宏基因组中挖掘病毒序列,有助于揭示病毒与宿主之间的关系,并扩充病毒数据库。由于大多数组装的宏基因组连续片段较短,且含有少量或无预测基因,且大多数宏基因组病毒基因与已知的病毒基因差异较大,因此,基于对齐的当前方法不适用于从宏基因组序列中识别病毒序列。在本研究中,我开发了一种基于马尔可夫模型的方法,名为VirMC,用于从宏基因组数据中识别病毒序列。VirMC利用马尔可夫链来模拟序列特征,并通过似然测试构建评分模型,以区分病毒和细菌序列。与VirFinder和PPR-Meta这两种最先进的病毒序列预测方法相比,我所提出的方法在识别长度小于400碱基对的短连续片段时,表现优于VirFinder,且与PPR-Meta具有相似的性能。对于含有真核生物序列的污染宏基因组样本,VirMC在识别病毒序列方面优于VirFinder和PPR-Meta。基于过滤潜在细菌读数的宏基因组数据组装病毒基因组序列时,VirMC表现出更佳的性能。将VirMC应用于健康受试者和2型糖尿病(T2D)患者的肠道宏基因组数据,揭示了病毒连续片段有助于区分健康和疾病状态。这种无需对齐的方法补充了基于基因的对齐方法,并将显著提高病毒序列识别的精确度。
提供机构:
frontiersin.figshare.com



