Plant virus SNP prediction artificial dataset Performance Study

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7431631

下载链接

链接失效反馈

官方服务：

资源简介：

Recent developments in high-throughput sequencing (HTS) technologies and bioinformatics have drastically changed research on viral pathogens, especially for virus discovery and monitoring. Indeed, proper monitoring of the viral population requires information on the different isolates circulating in the studied area. For this purpose, HTS technologies have greatly facilitated the generation of new genomes of the detected viruses and their comparison. Nevertheless, the bioinformatics analyses allowing the reconstruction of genomes and the detection of Single Nucleotide Polymorphisms (SNPs) can potentially create bias, although it has not been widely addressed so far. Therefore, more knowledge is required on the limitation and possibility of predicting SNPs based on HTS-generated sequence datasets. To address this issue, we compared the ability of 14 plant virology laboratories, each employing a different bioinformatics pipeline, to detect 21 variants of pepino mosaic virus (PepMV) through large-scale Performance Testing (PT) using three artificially designed datasets. The bioinformatics analyses were divided into three key steps: reads pre-processing (quality trimming, merging …), virus identification (assembly, alignment, mapping …) and variant calling. Each step was evaluated independently through an original, step-by-step PT design with iteration between participants. Overall, this work underlines key parameters in SNP detection and proposes recommendations for reliable variant calling for plant viruses. The identification of the closest reference, mapping parameters and manual validation of the prediction were the most impactful analysis step for the success or failure of the predictions. Strategies to improve SNPs prediction are also discussed.

高通量测序（high-throughput sequencing, HTS）技术与生物信息学的最新进展，极大地革新了病毒病原体相关研究，尤其在病毒发现与监测领域。事实上，对病毒种群开展精准监测，需要掌握研究区域内流行的各类病毒分离株的相关信息。为此，HTS技术极大地推动了检出病毒新基因组的生成及跨基因组比对分析的流程。不过，用于基因组重建与单核苷酸多态性（Single Nucleotide Polymorphisms, SNPs）检测的生物信息学分析流程，可能引入潜在偏倚，尽管该问题目前尚未得到广泛关注。因此，针对基于HTS生成的测序数据集开展SNPs预测的局限性与可行性，仍需开展更多研究以深化认知。为解决该问题，本研究依托3组人工设计的测序数据集开展大规模性能测试（Performance Testing, PT），比较了14家采用不同生物信息学分析流程的植物病毒学实验室对21株茄斑驳病毒（pepino mosaic virus, PepMV）变异株的检测能力。本次分析的生物信息学流程分为三个核心步骤：测序读段预处理（reads pre-processing，含质量修剪、序列合并等）、病毒鉴定（含基因组组装、序列比对、reads映射等）与变异株识别（variant calling）。本研究通过原创的分步式性能测试设计，并依托参与实验室间的迭代验证，对每个步骤分别开展独立评估。综上，本研究明确了SNPs检测中的关键参数，并针对植物病毒的可靠变异株识别提出了实操建议。其中，最匹配参考基因组的筛选、比对参数设置与预测结果的手动验证，是影响SNPs检测成败的核心分析环节。本研究同时探讨了优化SNPs预测效果的相关策略。

创建时间：

2022-12-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集