five

Lost in translation: the pitfalls of Ensembl gene annotations between human genome assemblies and their impact on diagnostics

收藏
DataCite Commons2023-08-29 更新2024-08-18 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Lost_in_translation_the_pitfalls_of_Ensembl_gene_annotations_between_human_genome_assemblies_and_their_impact_on_diagnostics/23709768
下载链接
链接失效反馈
官方服务:
资源简介:
Gene models based on GRCh37 human genome assembly are preferred by many international projects over other updated assemblies (GRCh38 and T2T). Discrepant genes (DGs), those recognized as protein coding in the new but not the old assembly, are ignored by several genomic resources and discarded by variant prioritization tools relying on information based on GRCh37. We curated a set of Ensembl genes with discrepant annotations between GRCh37 and GRCh38, additionally matching their RefSeq transcripts. Furthermore, we examined their clinical and phenotypic relevance. A total of 337 genes were reclassified as ‘protein-coding’ in GRCh38 but not in GRCh37, with 194 having a discrepant HGNC gene symbol. Many remain missing from the current known RefSeq gene models (<i>N</i> = 73). We found many clinically relevant genes in this group of neglected genes, and we anticipate that many more will be found relevant in the future. Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes, further relegating them into oblivion. For discrepant genes, the inaccurate label of ‘non-protein-coding’ has relevant ramifications on clinical genetics. Accurate collation of these genes allows for manual curation in clinically relevant scenarios.

相较于其他更新版本的人类基因组组装版本(GRCh38与T2T),诸多国际项目更青睐基于GRCh37人类基因组组装的基因模型。差异基因(DGs)指在新版本基因组中被注释为蛋白编码基因,但在旧版本中未获此标注的基因,此类基因被多项基因组资源忽略,且依赖GRCh37参考信息的变异优先级分析工具会将其过滤剔除。我们整理了一批在GRCh37与GRCh38间存在注释差异的Ensembl基因,并同步匹配了其RefSeq转录本。此外,我们还分析了这些基因的临床与表型相关性。总计337个基因在GRCh38中被重新归类为"蛋白编码基因",但在GRCh37中并非如此;其中194个基因的HGNC基因符号存在差异。其中73个基因目前仍未被纳入已知的RefSeq基因模型中。我们在这批被忽视的差异基因中发现了多个具有临床相关性的基因,并预计未来还将发现更多此类相关基因。此类基因也未被计算进化约束指标等重要附加注释信息,进一步加剧了它们被边缘化乃至遗忘的处境。对于差异基因而言,将其错误标注为"非蛋白编码基因"会对临床遗传学研究产生切实影响。对这些基因进行精准整理,可为临床相关场景下的人工注释提供支撑。
提供机构:
Taylor & Francis
创建时间:
2023-07-19
二维码
社区交流群
二维码
科研交流群
商业服务