Supplementary information for: NUMT PARSER: Automated identification and removal of nuclear mitochondrial pseudogenes (numts) for accurate mitochondrial genome reconstruction in Panthera

NIAID Data Ecosystem2026-03-14 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.6t1g1jx33

下载链接

链接失效反馈

官方服务：

资源简介：

Nuclear mitochondrial pseudogenes (numts) may hinder the reconstruction of mtDNA genomes and affect the reliability of mtDNA datasets for phylogenetic and population genetic comparisons. Here, we present the program Numt Parser, which allows for the identification of DNA sequences that likely originate from numt pseudogene DNA. Sequencing reads are classified as originating from either numt or true cytoplasmic mitochondrial (cymt) DNA by direct comparison against cymt and numt reference sequences. Classified reads can then be parsed into cymt or numt datasets. We tested this program using whole genome shotgun-sequenced data from two ancient Cape lions (Panthera leo) because mtDNA is often the marker of choice for ancient DNA studies, and the genus Panthera is known to have numt pseudogenes. Numt Parser decreased sequence disagreements that were likely due to numt pseudogene contamination and equalized read coverage across the mitogenome by removing reads that likely originated from numts. We compared the efficacy of Numt Parser to two other bioinformatic approaches that can be used to account for numt contamination. We found that Numt Parser outperformed approaches that rely only on read alignment or Basic Local Alignment Search Tool (BLAST) properties, and was effective at identifying sequences that likely originated from numts while having minimal impacts on the recovery of cymt reads. Numt Parser therefore improves the reconstruction of true mitogenomes, allowing for more accurate and robust biological inferences. Methods Sequencing reads and alignments generated from ancient DNA of two Cape Lion (Panthera leo melanochaitus) samples. Raw reads were aligned to the Panthera leo mitochondrial reference (NCBI Accession KP202262.1) to obtain mitochondrial-specific reads. These mitochondrial reads were then processed using different methods (BLAST, SAMtools, Numt Parser) to identify and filter Numt-contaminant reads. See de Flamingh, et al. (2022) for additional information on the specific bioinformatic pipeline used and a description of the Numt Parser software.

核线粒体假基因（nuclear mitochondrial pseudogenes，numts）可能会阻碍线粒体DNA（mitochondrial DNA，mtDNA）基因组的组装，并影响用于系统发育与群体遗传比较的线粒体DNA数据集的可靠性。本研究推出了Numt Parser工具，可用于识别大概率源自核线粒体假基因的DNA序列。通过与细胞质线粒体DNA（cytoplasmic mitochondrial，cymt）及核线粒体假基因的参考序列直接比对，可将测序读段归类为源自核线粒体假基因或真实细胞质线粒体DNA，完成分类的读段可进一步被拆分至细胞质线粒体DNA数据集或核线粒体假基因数据集。本研究使用两头古代开普狮（Panthera leo）的全基因组鸟枪测序数据对该工具进行了测试：线粒体DNA是古代DNA研究中常用的分子标记，且豹属（Panthera）物种已知存在核线粒体假基因。Numt Parser可通过移除大概率源自核线粒体假基因的读段，减少由核线粒体假基因污染导致的序列不一致性，并使线粒体基因组的读段覆盖度趋于均衡。本研究将Numt Parser的性能与另外两种可用于校正核线粒体假基因污染的生物信息学方法进行了对比。研究结果显示，相较于仅依赖读段比对或基本局部比对搜索工具（Basic Local Alignment Search Tool，BLAST）特征的方法，Numt Parser的性能更优；其可有效识别源自核线粒体假基因的序列，同时对细胞质线粒体DNA读段的回收影响极小。因此，Numt Parser可优化真实线粒体基因组的组装，助力获得更为精准且可靠的生物学推断。研究方法本研究的测序读段与比对结果来自两头古代开普狮（Panthera leo melanochaitus）样本的古代DNA。将原始读段比对至狮子（Panthera leo）线粒体参考序列（NCBI登录号KP202262.1），以获取线粒体特异性读段。随后采用三种不同方法（BLAST、SAMtools、Numt Parser）对这些线粒体读段进行处理，以识别并过滤受核线粒体假基因污染的读段。有关本次研究使用的具体生物信息学流程及Numt Parser工具的详细说明，请参阅de Flamingh等人（2022）的相关文献。

创建时间：

2022-12-06