Improving Illumina assemblies with Hi-C and long reads: an example with the North African dromedary

NIAID Data Ecosystem2026-03-11 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.6rp36b6

下载链接

链接失效反馈

官方服务：

资源简介：

Researchers have assembled thousands of eukaryotic genomes using Illumina reads, but traditional mate-pair libraries cannot span all repetitive elements, resulting in highly fragmented assemblies. However, both chromosome conformation capture techniques, such as Hi-C and Dovetail Genomics Chicago libraries and long-read sequencing, such as Pacific Biosciences and Oxford Nanopore, help span and resolve repetitive regions and therefore improve genome assemblies. One important livestock species of arid regions that does not have a high-quality contiguous reference genome is the dromedary (Camelus dromedarius). Draft genomes exist but are highly fragmented, and a high-quality reference genome is needed to understand adaptation to desert environments and artificial selection during domestication. Dromedaries are among the last livestock species to have been domesticated, and together with wild and domestic Bactrian camels, they are the only representatives of the Camelini tribe, which highlights their evolutionary significance. Here we describe our efforts to improve the North African dromedary genome. We used Chicago and Hi-C sequencing libraries from Dovetail Genomics to resolve the order of previously assembled contigs, producing almost chromosome-level scaffolds. Remaining gaps were filled with Pacific Biosciences long reads, and then scaffolds were comparatively mapped to chromosomes. Long reads added 99.32 Mbp to the total length of the new assembly. Dovetail Chicago and Hi-C libraries increased the longest scaffold over 12-fold, from 9.71 Mbp to 124.99 Mbp and the scaffold N50 over 50-fold, from 1.48 Mbp to 75.02 Mbp. We demonstrate that Illumina de novo assemblies can be substantially upgraded by combining chromosome conformation capture and long-read sequencing.

研究人员已利用Illumina测序读段（Illumina reads）组装了数千个真核生物基因组，但传统的mate-pair文库（mate-pair libraries）无法覆盖所有重复序列区域，导致基因组组装结果高度碎片化。然而，染色体构象捕获技术（chromosome conformation capture techniques，如Hi-C、Dovetail Genomics Chicago文库）与长读长测序技术（long-read sequencing，如Pacific Biosciences、Oxford Nanopore）均可覆盖并解析重复区域，从而优化基因组组装效果。单峰驼（Camelus dromedarius）是干旱地区的重要家畜物种，目前尚未拥有高质量的连续性参考基因组。当前虽存在基因组草图（draft genome），但组装结果高度碎片化；而高质量参考基因组对于解析其对沙漠环境的适应性，以及驯化过程中的人工选择机制至关重要。单峰驼是最晚被驯化的家畜物种之一，与野生及家养双峰驼同属骆驼族（Camelini tribe），这凸显了其进化研究价值。本研究针对北非单峰驼基因组的优化工作展开如下介绍：我们利用Dovetail Genomics提供的Chicago与Hi-C测序文库，对前期组装得到的重叠群（contig）进行排序，构建出接近染色体水平的支架序列（scaffold）。随后使用Pacific Biosciences长读长测序数据填补剩余间隙，并将最终支架序列与染色体进行比较作图。长读长测序数据为新组装基因组总长度新增了99.32 Mbp。Dovetail Genomics的Chicago与Hi-C文库将最长支架序列的长度提升逾12倍，从9.71 Mbp增至124.99 Mbp；同时将支架N50值（scaffold N50）提升逾50倍，从1.48 Mbp提升至75.02 Mbp。本研究证实，结合染色体构象捕获技术与长读长测序技术，可大幅升级基于Illumina测序的从头组装（de novo assembly）结果。

创建时间：

2019-03-27