Data from polishCLR: Example input genome assemblies

Name: Data from polishCLR: Example input genome assemblies
Creator: Ag Data Commons
Published: 2024-02-13 00:00:00
License: 暂无描述

agdatacommons.nal.usda.gov2024-02-13 更新2025-03-23 收录

下载链接：

https://agdatacommons.nal.usda.gov/articles/dataset/Data_from_polishCLR_Example_input_genome_assemblies/24667776/1

下载链接

链接失效反馈

官方服务：

资源简介：

[ NOTE - Data files added 2022-11-01: Test long reads - test.1.filtered.bam_.gz Test short reads R1 - testpolish_R1.fastq Test short reads R2 - testpolish_R2.fastq Chromosome 30 of H. zea - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ] In order to produce the best possible de novo, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow. The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta. These example data are the input contigs assemblies for the pest Helicoverpa zea. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single H. zea HzStark_Cry1AcR strain male. Adult H. zea were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences). The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). Resources in this dataset:Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon. File Name: a_ctg_all.fastaResource Title: Primary assembly contigs output from FALCON/2-asm-falcon. File Name: p_ctg.fastaResource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip. File Name: all_h_ctg.fastaResource Title: Primary assembly contigs output from FALCON Unzip 3-unzip. File Name: all_p_ctg.fastaResource Title: Alternate assembly contigs output from FALCON Unzip 4-polish. File Name: cns_h_ctg.fastaResource Title: Primary assembly contigs output from FALCON Unzip 4-polish. File Name: cns_pctg.fastaResource Title: Test long reads. File Name: test.1.filtered.bam.gzResource Description: For testing the pipeline, long reads that map to H. zea chromosome 30Resource Title: Test short reads R1. File Name: testpolish_R1.fastqResource Description: Short reads aligned to Chromosome 30 of H. zeaResource Title: Test short reads R2. File Name: testpolish_R2.fastqResource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zeaResource Title: Chromosome 30 of H. zea. File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta

备注 - 数据文件添加于 2022 年 11 月 1 日：测试长读序列 - test.1.filtered.bam_.gz 测试短读序列 R1 - testpolish_R1.fastq 测试短读序列 R2 - testpolish_R2.fastq H. zea 的第 30 号染色体 - GCF_022581195.2_ilHelZeax1.1_chr30.fasta 为从易出错的太平洋生物科学连续长读（CLR）序列中生成最佳可能的从头组装的染色体尺度基因组，我们开发了一个公开的、灵活且可重复的工作流程，该工作流程采用容器化形式，可在任何传统高性能计算（HPC）环境中运行，命名为 polishCLR。本数据集提供了示例输入主要连续序列组装，以测试和再现我们所展示的工作流程的实用价值。 polishCLR 工作流程可从以下三种输入情况轻松启动：情况 1：一个未解决的主要组装及其关联连续序列，FALCON 2-asm 的输出：p_ctg.fasta 和 a_ctg.fasta 情况 2：一个已解析单倍型但未抛光的集合，FALCON-Unzip 3-unzip 的输出：all_p_ctg.fasta 和 all_h_ctg.fasta 情况 3：一个已解析单倍型、CLR 长读、Arrow 抛光的连续和替代连续序列集合，FALCON-Unzip 4-polish 的输出：cns_p_ctg.fasta 和 cns_h_ctg.fasta。这些示例数据是害虫 H. zea 的输入连续序列组装。这些连续序列由 49.89 Gb 的原始太平洋生物科学（PacBio）CLR 数据构建而成，该数据来自单个 H. zea HzStark_Cry1AcR 稳定菌株的雄性。成虫 H. zea 于 2011 年在美国密西西比州斯塔克维尔附近的美国农业部-ARS 遗传与可持续农业研究单位采集，并运输至美国南部昆虫管理单位（SIMRU）的石泉市，MS，USA 进行维护，如前所述。幼虫在 2.0 μg ml-1 纯化 Cry1Ac 的诊断剂量下进行选择，并使用存活者创建菌株 HzStark_Cry1AcR。HzStark_Cry1AcR 每 5 代回交至 USDA-ARS SIMRU 维护的易感品系。 HzStark_Cry1AcR 的单个雄性蛹（同型合子，ZZ 性染色体）随后被横向切割成八个约 20 μg 的部分。提取了高分子量 DNA。使用 SMRTbell Express Template Prep Kit 2.0（太平洋生物科学，门洛帕克，CA，USA）从未剪切 DNA 中生成 PacBio 库，并在 Sequel I 系统上使用单个 SMRT Cell 1M v3 生成 20 小时的运行时间电影。使用 bamtools v. 2.5.1（Barnett 等，2011）将原始连续长读（CLR）subread bam 文件转换为 fastq 格式，然后作为 Falcon 组装器（Chin 等，2016）的输入，使用 pb-assembly conda 环境v. 0.0.8.1（太平洋生物科学；默认参数）。Falcon-Unzip 通过 Arrow（太平洋生物科学）进行一轮单倍型感知抛光，创建了主要和替代连续序列。本数据集包含的资源：资源标题：FALCON/2-asm-falcon 的相关组装连续序列输出。文件名：a_ctg_all.fasta 资源标题：FALCON/2-asm-falcon 的主要组装连续序列输出。文件名：p_ctg.fasta 资源标题：FALCON Unzip 3-unzip 的替代单倍型组装连续序列输出。文件名：all_h_ctg.fasta 资源标题：FALCON Unzip 3-unzip 的主要组装连续序列输出。文件名：all_p_ctg.fasta 资源标题：FALCON Unzip 4-polish 的替代组装连续序列输出。文件名：cns_h_ctg.fasta 资源标题：FALCON Unzip 4-polish 的主要组装连续序列输出。文件名：cns_pctg.fasta 资源标题：测试长读序列。文件名：test.1.filtered.bam.gz 资源描述：用于测试管道，映射到 H. zea 第 30 号染色体的长读序列。资源标题：测试短读序列 R1。文件名：testpolish_R1.fastq 资源描述：映射到 H. zea 第 30 号染色体的短读序列。资源标题：测试短读序列 R2。文件名：testpolish_R2.fastq 资源标题：H. zea 的第 30 号染色体。文件名：GCF_022581195.2_ilHelZeax1.1_chr30.fasta

提供机构：

Ag Data Commons

5,000+

优质数据集

54 个

任务类型

进入经典数据集