Novel Megaptera novaeangliae (Humpback whale) haplotype reference genome

NIAID Data Ecosystem2026-05-02 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.dv41ns271

下载链接

链接失效反馈

官方服务：

资源简介：

The sequencing of a kidney sample (KW2013002) from a stranded Megaptera novaeangliae (Humpback whale) calf is the first chromosome level reference genome for this species. The calf, a 457 cm and 2,500 lbs male, was found stranded in Hawai’i Kai, HI, in 2013 and was marked as abandoned/orphaned. In 2023, 1g of kidney was sequenced with PacBio long-read DNA sequencing, chromatin conformation capture (Hi-C), RNA sequencing, and mitochondrial sequencing to comprehensively characterize the genome and transcriptome of M. novaeangliae. The reference genome was compared to the preexisting M. novaeangliae scaffold to determine assembly improvements. Data validation includes a synteny analysis, mitochondrial annotation, and a comparison of BUSCO scores (scaffold v. reference genome and Balaenoptera musculus (Blue whale) v. M. novaeangliae). BUSCO analysis was performed on an M. novaeangliae scaffold-level assembly to determine genomic completeness of the reference genome, with a scaffold BUSCO score of 91.2% versus a score of 95.4% (Table I). Synteny analysis was performed using the B. musculus genome as comparison to determine chromosome level coverage and structure. Further, a time-based phylogenetic tree was constructed using the sequenced data and publicly available genomes. This dataset also contains the results of de novo repeat identification and gene annotation for the Humpback whale (Megaptera novaeangliae) genome. The repeat families were identified and classified using RepeatModeler, and gene prediction was conducted using AUGUSTUS and SNAP, incorporating coding sequences from related cetaceans. The resulting gene models were further refined using the MAKER pipeline, with protein evidence from Swiss-Prot and related species. tRNA genes were identified with tRNAscan-SE. The dataset includes the transcript sequences (GIU3625_Humpback_whale.transcript.fasta.gz), annotation file (GIU3625_Humpback_whale.annotation.gff.gz), and a methods file (methods.txt) detailing the bioinformatic processes. Methods Sample Information A kidney sample (KW2013002) was collected from a M. novaeangliae calf on January 15, 2013, in Hawai’i Kai, HI, and deposited at the National Institutes of Standards and Technology (NIST). The sample was not collected by the authors so information regarding collection is limited to that presented herein. The calf, a 457 cm and 2,500 lbs male at the time of necropsy, was first observed on January 14, 2013, in shallow water and died between January 14 and January 15, 2013, via stranding. The calf was marked as abandoned/orphaned. In 2023, 1g of KW2013002 was sampled for sequencing by Cantata Bio. PacBio long reads DNA sequencing Quantification of DNA samples was performed using the Qubit 2.0 Fluorometer. For the construction of the PacBio SMRTbell library, targeting an insert size of approximately 20kb, the SMRTbell Express Template Prep Kit 2.0 was employed following the manufacturer's recommended protocol and default settings. The library was subsequently prepared for sequencing by binding to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto the PacBio Sequel II system. Sequencing was executed using PacBio Sequel II 8M SMRT cells to ensure comprehensive coverage and high-quality reads. Quality control of the extracted DNA was performed using nanodrop and gel. The OmniC library quality control was done using the Hifiasm draft assembly and showed a high amount of long-range linkage reads. The OmniC sequencing data was also quality controlled to examine Q30%, and the quality score matched the Illumina standard. The scaffolding algorithm HiRise also has a built-in quality control that uses only reads with a map score of over 40. Chromatin was fixed in situ within the nucleus using formaldehyde, followed by digestion with DNase I. The processed chromatin had its ends repaired and was then ligated to a biotinylated bridge adapter, facilitating proximity ligation of adapter-containing ends. Post-proximity ligation, the crosslinks were reversed, and the DNA was purified—a critical step involved treating the purified DNA to eliminate any non-internal biotin. The sequencing libraries were prepared using NEBNext Ultra enzymes and Illumina-compatible adapters, with biotin-containing fragments isolated using streptavidin beads before PCR enrichment. Sequencing was performed on an Illumina HiSeqX platform to achieve approximately 30x coverage. Contig assembling and scaffolding The de novo assembly process utilized PacBio CCS reads and Omni-C reads as input for HiC-Hifiasm, employing default parameters. This approach facilitated the generation of a separate de novo assembly for each haplotype, enhancing the accuracy and integrity of the genomic reconstruction. The scaffolding phase involved the integration of the de novo assembly with Dovetail Omni-C library reads through HiRise, a software pipeline tailored for scaffolding genome assemblies using proximity ligation data. Alignment of Omni-C library sequences to the draft assembly was achieved using bwa, with the mapped read pairs analyzed by HiRise to construct a likelihood model for genomic distance (See Figure S1). This model, along with additional information from the synteny analysis (see below), informed the identification and correction of misjoins, the scoring of potential joins, and the execution of joins exceeding a defined confidence threshold. Synteny analysis The M. novaeangliae newly-assembled scaffolds were mapped to the B. musculus whole genome (GenBank GCA_009873245.3) in order to map the synteny between the two species.9,10 A synteny analysis was performed using JupiterPlot 1.0,11 a software tool that uses circos-based consistency plots to map a given set of scaffolds with a reference genome. RNA sequencing Total RNA was extracted employing the QIAGEN RNeasy Plus Kit, adhering to the manufacturer's instructions. Quantification of RNA involved the Qubit RNA Assay and the TapeStation 4200 system. Before library preparation, DNase treatment was applied, followed by AMPure bead cleanup and rRNA depletion using QIAGEN FastSelect -HMR. The NEBNext Ultra II RNA Library Prep Kit was used for library preparation per the manufacturer's protocols. Sequencing of the prepared libraries was conducted on the NovaSeq 6000 platform, utilizing a 2 x 150 bp configuration to ensure comprehensive transcriptome coverage. Repeat Analysis This dataset was derived from a Humpback whale (Megaptera novaeangliae) genome assembly. The repeat families found in the genome were identified de novo using RepeatModeler (v2.0.1), which relies on RECON (v1.08) and RepeatScout (v1.0.6). The custom repeat library generated from RepeatModeler was then used to discover, identify, and mask the repeats in the assembly using RepeatMasker (v4.1.0). Gene prediction was performed using the AUGUSTUS software (v2.5.5) with six rounds of optimization. Coding sequences from related cetacean species, including Balaenoptera acutorostrata, Balaenoptera musculus, Balaenoptera ricei, Megaptera novaeangliae, and Orcinus orca, were used to train the ab initio models for gene prediction. Additionally, the SNAP software (v2006-07-28) was trained using the same coding sequences to build a separate gene prediction model. RNA-seq reads were mapped to the genome using the STAR aligner (v2.7), and intron hints were generated using the bam2hints tool within AUGUSTUS. MAKER was then employed to integrate the predictions from AUGUSTUS and SNAP, combining this information with peptide evidence from the UniProt database and protein sequences from related cetacean species. Only gene models predicted by both AUGUSTUS and SNAP were retained in the final dataset. Annotation Edit Distance (AED) scores were generated for each predicted gene as part of the MAKER pipeline to assess the accuracy of the predictions. Finally, tRNA genes were identified using the tRNAscan-SE software (v2.05). Acknowledgments The specimens used in this study were collected by Kristi West, University of Hawaii, and provided by the National Marine Mammal Tissue Bank (NMMTB), which is maintained by the National Institute of Standards and Technology (NIST) at the NIST Biorepository, Hollings Marine Laboratory, Charleston, SC. The NMMTB is operated under the direction of the National Oceanic and Atmospheric Administration/National Marine Fisheries Service (NOAA Fisheries) with the collaboration of the U.S. Geological Survey, U.S. Fish and Wildlife Service, the (former) Minerals Management Service, and NIST, through the Marine Mammal Health and Stranding Response Program.

创建时间：

2025-03-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集