Supporting data for "De Novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads"

Mendeley Data2024-06-25 更新2024-06-29 收录

下载链接：

http://gigadb.org/dataset/100311

下载链接

链接失效反馈

官方服务：

资源简介：

Reference quality genomes provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna’s hummingbird reference, two vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult to sequence regions, complex repeat structure errors, and allelic differences between the two haplotypes. These improvements were validated by single long genome and transcriptome reads, and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating high quality assemblies necessary for understanding gene structure, function, and evolution.

高质量参考基因组为研究基因结构、功能与演化提供了核心资源。然而，目标基因往往无法被完整且精准地组装，这会导致分析过程中出现未知错误，或是需要额外开展克隆工作以获取正确的序列。长读长测序（long-read sequencing）是颇具前景的解决方案。本研究针对基于桑格测序（Sanger sequencing）的中读长斑胸草雀参考基因组，以及基于Illumina短读长测序的安氏蜂鸟参考基因组，测试了基于PacBio的长读长测序与二倍体组装技术，以期对上述两个在神经科学与基因组学领域被广泛研究的鸣禽发声学习物种的参考基因组进行优化。本研究使用与构建参考基因组相同的个体的DNA，通过FALCON-Unzip组装软件生成了二倍体组装结果，获得了兆碱基级别无间隙的重叠群（contig），相较于当前的斑胸草雀与安氏蜂鸟参考基因组，分别实现了150倍与200倍的质量提升。这些长读长测序与单倍型分型组装结果，修正并解决了参考基因组中我们所发现的大量组装错误，包括间隙处缺失的序列、间隙侧翼的错误序列、难测序区域的碱基识别错误、复杂重复结构错误，以及两种单倍型（haplotype）之间的等位基因差异。上述优化通过单条长读长基因组与转录组读段（read）得到了验证，并且首次实现了神经科学领域广泛研究、且特化于鸣禽发声学习物种的蛋白质编码基因的完整解析。本研究结果证实了长读长测序、以往难测序区域的测序，以及单倍型分型在生成高质量组装结果中的重要作用——而这类高质量组装结果是解析基因结构、功能与演化的必要基础。

创建时间：

2023-06-28