Simulated Nanopore metagenomic reads

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8339788

下载链接

链接失效反馈

官方服务：

资源简介：

We simulated metagenomic Nanopore sequencing reads to a mixture ratio that approximates that found in patient sputa, albeit with a slightly higher mycobacterial component. In total, 4.5 gigabases were generated, at proportions: 46% each for bacteria and human, 6\% Mycobacterium tuberculosis complex (MTBC), and 1% each for virus and non-tuberculous mycobacteria (NTM). The reference genomes that reads were simulated from for these groups were gathered as follows. The references for the virus group were obtained using kraken's (v2.1.2) --download-library functionality. The viral library was downloaded on June 15 2023. The human genome from which the reads were simulated was KOREF_S1v2.1 (RefSeq accession GCA_020497085.1), with contigs shorter than 10kbp removed. The bacterial references were obtained by first downloading the bacteria library through kraken, followed by a subsampling due to the size (166Gb) of the resulting FASTA file. We subsampled the file by first removing sequences with a length <50kbp. We then extracted each sequence into its own FASTA file under a directory for the genus of the sequence - excluding the Mycobacterium genus. Genera were randomly subsampled to contain a maximum of 1000 assemblies. Each genus was then reduced to a representative subset using Assembly Dereplicator (commit 2dfcb14; https://github.com/rrwick/Assembly-Dereplicator) by keeping only 10% of the assemblies for each genus (-f 0.1). The NTM references selected were M. abscessus (accession GCF_017190695.1), M. avium (GCF_020735285.1), M. kansasii (GCA_014701265.1), M. ulcerans (GCF_000013925.1), M. intracellulare (GCF_016756075.1), M. terrae (GCF_010727125.1), and M. fortuitum (GCF_001307545.1). The MTBC reference is a lineage 1 assembly (GCF_932530395.1). We used Badreads (v0.4.0) to produce the simulated Nanopore reads for each group, specifying the number of bases in the appropriate proportions mentioned above. For all groups we specified no junk or random reads and 0.5% chimeric reads. In addition, for the MTBC, virus, and NTM groups we used a non-default length option --length 4000,3000 to produce reads with mean length 4000bp and a standard deviation of 3000. Defaults were used for all other options (the default error model is trained on real R10.4.1 Nanopore reads from 2023). We filtered the simulated Nanopore reads to remove any read with a length <500bp or an ambiguous nucleotide (non-ACGT).

创建时间：

2024-02-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集