SNP-Associations and Phenotype Predictions from Hundreds of Microbial Genomes without Genome Alignments
收藏NIAID Data Ecosystem2026-03-08 收录
下载链接:
https://figshare.com/articles/dataset/_SNP_Associations_and_Phenotype_Predictions_from_Hundreds_of_Microbial_Genomes_without_Genome_Alignments_/948998
下载链接
链接失效反馈官方服务:
资源简介:
SNP-association studies are a starting point for identifying genes that may be responsible for specific phenotypes, such as disease traits. The vast bulk of tools for SNP-association studies are directed toward SNPs in the human genome, and I am unaware of any tools designed specifically for such studies in bacterial or viral genomes. The PPFS (Predict Phenotypes From SNPs) package described here is an add-on to kSNP, a program that can identify SNPs in a data set of hundreds of microbial genomes. PPFS identifies those SNPs that are non-randomly associated with a phenotype based on the χ2 probability, then uses those diagnostic SNPs for two distinct, but related, purposes: (1) to predict the phenotypes of strains whose phenotypes are unknown, and (2) to identify those diagnostic SNPs that are most likely to be causally related to the phenotype. In the example illustrated here, from a set of 68 E. coli genomes, for 67 of which the pathogenicity phenotype was known, there were 418,500 SNPs. Using the phenotypes of 36 of those strains, PPFS identified 207 diagnostic SNPs. The diagnostic SNPs predicted the phenotypes of all of the genomes with 97% accuracy. It then identified 97 SNPs whose probability of being causally related to the pathogenic phenotype was >0.999. In a second example, from a set of 116 E. coli genome sequences, using the phenotypes of 65 strains PPFS identified 101 SNPs that predicted the source host (human or non-human) with 90% accuracy.
单核苷酸多态性(Single Nucleotide Polymorphism,SNP)关联研究是鉴定可能与特定表型(如疾病性状)相关基因的研究起点。目前绝大多数SNP关联研究工具均针对人类基因组中的SNP开发,目前尚无专门针对细菌或病毒基因组的此类研究工具的相关报道。本文所介绍的PPFS(Predict Phenotypes From SNPs)软件包是kSNP的扩展插件,kSNP是一款可在数百个微生物基因组数据集内鉴定SNP的程序。PPFS基于卡方(χ²)概率筛选出与表型非随机关联的SNP,随后将这些诊断性SNP用于两个截然不同却又紧密相关的用途:其一,预测表型未知菌株的表型;其二,鉴定出最有可能与该表型存在因果关联的诊断性SNP。在本文展示的首个案例中,研究对象为68个大肠杆菌(Escherichia coli,E. coli)基因组,其中67个的致病性表型已知,共检测到418500个SNP。利用其中36株菌株的表型数据,PPFS筛选出207个诊断性SNP,该组诊断性SNP对所有基因组表型的预测准确率达97%。随后,PPFS进一步鉴定出97个与致病性表型存在因果关联的概率大于0.999的SNP。在第二个案例中,研究对象为116个大肠杆菌(E. coli)基因组序列,利用其中65株菌株的表型数据,PPFS筛选出101个SNP,该组SNP对菌株宿主来源(人类或非人类)的预测准确率达90%。
创建时间:
2014-02-28



