Hybrid Enterobacteriaceae assemblies using PacBio+Illumina or ONT+Illumina sequencing

Name: Hybrid Enterobacteriaceae assemblies using PacBio+Illumina or ONT+Illumina sequencing
Creator: figshare
Published: 2020-08-27 19:40:50
License: 暂无描述

DataCite Commons2020-08-27 更新2024-07-27 收录

下载链接：

https://figshare.com/articles/Hybrid_Enterobacteriaceae_assemblies_using_PacBio_Illumina_or_ONT_Illumina_sequencing/7649051/3

下载链接

链接失效反馈

官方服务：

资源简介：

Data associated with: De Maio, Shaw, et al. on behalf of the REHAB consortium (2019), Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. biorxiv 530824 Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. This set of files includes all hybrid assemblies produced using Unicycler with different sequencing approaches and strategies. Each isolate has 8 hybrid assemblies = 4 x ONT-Illumina + 4 x PacBio-Illumina. There are a total of 158 hybrid assemblies from the full data as two assemblies did not finish (8x20 - 2 = 160 - 2 = 158). Additionally, there are Assemblies were produced from different long read preparation strategies. Hybrid assemblies with Unicycler (n1 = 158): • Basic: no filtering or correction of reads (i.e. all long reads available used for assembly). • Corrected: Long reads were error-corrected and subsampled (preferentially selecting longest reads) to 30-40x coverage using Canu (v1.5, https://github.com/marbl/canu) with default options. • Filtered: long reads were filtered using Filtlong (v0.1.1, https://github.com/rrwick/Filtlong) by using Illumina reads as an external reference for read quality and either removing 10% of the worst reads or by retaining 500Mbp in total, whichever resulted in fewer reads. We also removed reads shorter than 1kb and used the --trim and --split 250 options. • Subsampled: we randomly subsampled long reads to leave approximately 600Mbp (corresponding to a long read coverage around 100x). Long-read only assemblies (n2 = 20 x 2 x 2 = 80):• Flye: we ran Flye (https://github.com/fenderglass/Flye) with the options --plasmids --meta, which have been shown to improve the assemblies of plasmids in bacterial genomes (see: https://github.com/rrwick/Long-read-assembler-comparison) • Pilon: the Flye assemblies were then polished with Illumina short-reads using Pilon (https://github.com/broadinstitute/pilon). Assembly file names have the following format: ${sample-name}_${preparation-strategy}_${long-read-sequencing}.fastae.g. for sample CFT073 the filtered PacBio-Illumina assembly is: CFT073_filtered_pacbio.fasta Also included are assemblies produced after subsampling long-read data to ~10X genome coverage for the following strategies: "basic" (hybrid) and long-read ("flye" and "pilon"). There are n3 = 20 x 3 x 2 = 120 of these assemblies. These have a '10X' preceding the preparation strategy. The total number of assemblies is n1+n2+n3=158+80+120=358. Also included is a pdf of supplementary figures and an Excel spreadsheet of supplementary tables. See the associated preprint for more details: https://doi.org/10.1101/530824 and the published article in Microbial Genomics (currently in press).

本数据集关联文献：De Maio、Shaw等人代表REHAB联盟（2019），《复杂细菌基因组混合组装中长读长测序技术的比较》，预印本发布于bioRxiv 530824。 Illumina测序可实现快速、低成本且精准的细菌全基因组分析，但短读长序列（<300 bp）通常无法完成完整的基因组组装。长读长测序技术可极大助力复杂细菌基因组的解析，尤其是与Illumina短读长数据结合开展混合组装时。然而目前尚不明确不同长读长测序方法对组装准确性的影响。本研究针对20株细菌分离株（包含2株参考菌株），分别采用Illumina测序结合牛津纳米孔技术（Oxford Nanopore Technologies, ONT）长读长测序，或Illumina测序结合SMRT太平洋生物科学（Pacific Biosciences, PacBio）测序平台的长读长测序，开展混合组装对比分析。本数据集包含所有通过Unicycler工具，采用不同测序方案与策略完成的混合组装结果。每株分离株对应8组混合组装结果，即4组ONT-Illumina组合与4组PacBio-Illumina组合。完整数据集共包含158组混合组装结果：因2组组装未完成，8×20 - 2 = 160 - 2 = 158。此外，本数据集涵盖基于不同长读长建库策略生成的组装结果： ### 基于Unicycler的混合组装（n₁=158） • **基础（Basic）**：未对读长进行任何过滤或校正，即使用所有可用的长读长序列开展组装。 • **校正（Corrected）**：使用Canu（v1.5，https://github.com/marbl/canu）默认参数，对长读长序列进行错误校正并下采样至30~40×覆盖度，优先选取最长读长序列。 • **过滤（Filtered）**：使用Filtlong（v0.1.1，https://github.com/rrwick/Filtlong），以Illumina读长序列作为外部参考评估读长质量，过滤掉10%质量最差的读长，或保留总长度500 Mbp的读长（以二者中读长数量更少的方案为准）；同时移除长度小于1 kb的读长，并启用--trim与--split 250参数。 • **下采样（Subsampled）**：随机对长读长序列进行下采样，使其总长度约为600 Mbp（对应长读长覆盖度约为100×）。 ### 仅长读长组装结果（n₂=20×2×2=80） • **Flye**：使用Flye工具（https://github.com/fenderglass/Flye），添加--plasmids与--meta参数，该参数组合已被证实可优化细菌基因组中的质粒组装（参考来源：https://github.com/rrwick/Long-read-assembler-comparison）。 • **Pilon**：使用Pilon工具（https://github.com/broadinstitute/pilon），以Illumina短读长序列对Flye组装结果进行抛光校正。组装文件的命名格式为：${样本名称}_${建库策略}_${长读长测序平台}.fasta。例如，菌株CFT073的过滤型PacBio-Illumina组装结果文件名为：CFT073_filtered_pacbio.fasta。此外，本数据集还包含针对以下策略，将长读长数据下采样至约10×基因组覆盖度后生成的组装结果：“基础”（混合组装）以及仅长读长组装（“flye”与“pilon”）。此类组装结果共n₃=20×3×2=120组，其文件名在建库策略前添加“10X”标识。总组装结果数量为n₁+n₂+n₃=158+80+120=358组。本数据集还包含补充图PDF文件与补充表Excel表格。更多详细信息可查阅相关预印本：https://doi.org/10.1101/530824，以及已被《Microbial Genomics》接收待发表的正式论文。

提供机构：

figshare

创建时间：

2019-01-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集