When is an SNP not an SNP?
收藏DataCite Commons2024-09-12 更新2024-11-06 收录
下载链接:
https://tandf.figshare.com/articles/dataset/When_is_an_SNP_not_an_SNP_/27003526
下载链接
链接失效反馈官方服务:
资源简介:
Genomic duplications are important sources of structural change and gene innovation. In humans, the most recent and highly identical sequences (>90% homology, >1 kb long) are known as segmental duplications (SDs). Single-nucleotide variants or single-nucleotide polymorphisms within SDs have not been systematically assessed due to limitations around mapping short-read sequencing data. Single-nucleotide variant rs62486260 was flagged in a study of familial renal stone disease but it was unclear whether it was real or an artifact resulting from the presence of a SD. We describe <i>in silico</i> and wet-lab approaches to investigate this, using segment-specific long-PCR assays, followed by short PCR for Sanger sequencing. Our conclusion was that rs62486260 is an artifact. Our approach can be generalized to deal with other such situations. The method described includes a two-step procedure for determining whether an apparent single-nucleotide polymorphism may be an artifact resulting from the presence of a duplicated genomic region/pseudogene. Step one involves identifying sequence differences between the two duplicated regions and designing a long PCR assay to specifically amplify each region separately. Step 2 involves amplifying a short PCR product which flanks the single-nucleotide polymorphism of interest, from the long products generated in step 1. Genomic duplications have long been recognized as important sources of structural alterations and gene innovation. Single-nucleotide variants (SNVs) or single-nucleotide polymorphisms (SNPs) within segmental duplications (SDs) have not been systematically assessed due to limitations around mapping short-read sequencing data. SNV rs62486260 was flagged in a study of familial renal stone disease, but it was unclear whether it was real or an artifact resulting from the presence of an SD. We describe <i>in silico</i> and wet-lab approaches to investigate, using segment-specific long-PCR assays, followed by short PCR for Sanger sequencing. The method described includes a two-step procedure (<i>in silico</i> and wet-lab analysis) for determining whether an apparent SNP may be an artifact resulting from the presence of a duplicated genomic region/pseudogene. Step one (<i>in silico</i> analysis) involves identifying sequence differences between the two duplicated regions and designing a long PCR assay to specifically amplify each region separately. Step two (wet-lab analysis) involves amplifying a short PCR product which flanks the SNP of interest from the long products generated in step one and subsequent Sanger sequencing. Problems with rs62486260 are noted in both gnomAD and UCSC databases. The discrepancy between AF reported for this variant by UKBB (∼0.45) and gnomAD (∼0.02) hinted at an issue with the variant. The hg19 <i>in silico</i> PCR using short product primers predicted that three separate products are amplified, all of identical size. Using the same approaches as hg19 above, hg38 <i>in silico</i> PCR also revealed three matches to chr7. However, there were only two matches to a ‘fixed’ chromosome chr7_KZ208912v1_fix. Analysis using telomere-to-telomere (T2T) CHM13v2.0/hs1 reveals 2 hits using the primers. Interestingly, rs62486260 is not reported in two more recent versions of freezes T2T CHM13v2.0/hs1 and chr7_KZ208912v1_fix. Long-read sequencing data analysis showed the ‘SNV’ of interest, when present, lies in the region overlying a pseudogene. Genotyping of the samples confirmed that the rs62486260 is an artifact due to the presence of a pseudogene. Based on the latest human genome freeze [Jan. 2022 (T2T CHM13v2.0/hs1)], the centromeric region contains the gene <i>TCAF2</i>, whereas the telomeric region contains the pseudogene. Pseudogenes located in SDs are a hidden peril when determining the likely clinical significance of SNPs reported from genomic sequencing. The observed ‘SNP’ actually lies within a pseudogene and is therefore much less likely to be causally associated with the phenotype of interest.
基因组重复是结构变异与基因创新的重要来源。在人类中,最新且高度同源(同源性>90%、长度>1 kb)的序列被称为分段重复序列(segmental duplications, SDs)。由于短读长测序数据的比对存在局限性,分段重复序列内的单核苷酸变异(single-nucleotide variants, SNVs)或单核苷酸多态性(single-nucleotide polymorphisms, SNPs)尚未得到系统性评估。在一项家族性肾结石病的研究中,单核苷酸变异rs62486260被标记出来,但尚不清楚该变异是真实存在的,还是由分段重复序列的存在所导致的假阳性结果。我们采用计算机模拟(in silico)与湿实验方法开展相关研究,首先使用区段特异性长聚合酶链反应(long PCR)实验,随后通过短聚合酶链反应结合桑格测序进行验证。我们的研究结果表明,rs62486260实为假阳性结果。该研究方法可推广应用于其他类似场景。本研究所述方法包含一套两步流程,用于判断某一疑似单核苷酸多态性是否由基因组重复区域/假基因的存在所导致的假阳性结果。第一步,明确两个重复区域之间的序列差异,并设计长聚合酶链反应体系以分别特异性扩增每个区域。第二步,以第一步获得的长扩增产物为模板,扩增包含目标单核苷酸多态性侧翼序列的短聚合酶链反应产物。基因组重复长期以来被认为是结构变异与基因创新的重要来源。分段重复序列内的单核苷酸变异或单核苷酸多态性因短读长测序数据比对的局限性,尚未得到系统性评估。在家族性肾结石病研究中标记出的单核苷酸变异rs62486260,其真实性及是否由分段重复序列导致的假阳性结果尚未明确。我们采用计算机模拟与湿实验方法开展研究,通过区段特异性长聚合酶链反应,随后利用短聚合酶链反应结合桑格测序完成验证。本方法包含一套结合计算机模拟与湿实验分析的两步流程,用于判断疑似单核苷酸多态性是否为基因组重复区域/假基因存在所引发的假阳性结果。第一步(计算机模拟分析):明确两个重复区域的序列差异,并设计长聚合酶链反应体系以分别特异性扩增每个区域。第二步(湿实验分析):以第一步获得的长扩增产物为模板,扩增包含目标单核苷酸多态性侧翼序列的短聚合酶链反应产物,并进行后续桑格测序。gnomAD与UCSC数据库均记录了rs62486260存在相关问题。英国生物银行(UK Biobank, UKBB)报告的该变异等位基因频率(allele frequency, AF)约为0.45,与gnomAD报告的约0.02存在显著差异,提示该变异可能存在异常。基于hg19的计算机模拟聚合酶链反应使用短产物引物时,预测可扩增出3个长度完全一致的独立产物。采用与hg19相同的方法,hg38的计算机模拟聚合酶链反应也显示出与7号染色体(chr7)的3处匹配。但在“固定化”染色体chr7_KZ208912v1_fix中仅存在2处匹配。使用端粒到端粒(telomere-to-telomere, T2T)CHM13v2.0/hs1参考基因组进行分析时,引物仅获得2次匹配结果。值得注意的是,在更新的两个版本T2T CHM13v2.0/hs1与chr7_KZ208912v1_fix中,均未收录rs62486260。长读长测序数据分析显示,该疑似“单核苷酸变异”若存在,则位于覆盖假基因的区域内。对样本进行基因分型后证实,rs62486260实为假阳性结果,其产生原因为假基因的存在。基于2022年1月发布的最新人类参考基因组版本(T2T CHM13v2.0/hs1),着丝粒区域包含基因TCAF2,而端粒区域则包含假基因。位于分段重复序列内的假基因,在判断基因组测序所报告的单核苷酸多态性潜在临床意义时,是一类易被忽视的干扰因素。本次观测到的“单核苷酸多态性”实际位于假基因内部,因此其与目标表型存在因果关联的可能性极低。
提供机构:
Taylor & Francis
创建时间:
2024-09-12



