Simulated Illumina metagenomic reads
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8339790
下载链接
链接失效反馈官方服务:
资源简介:
We simulated metagenomic Illumina sequencing reads to a mixture ratio that approximates that found in patient sputa, albeit with a slightly higher mycobacterial component. In total, 0.9 gigabases were generated, at proportions: 46% each for bacteria and human, 6\% Mycobacterium tuberculosis complex (MTBC), and 1% each for virus and non-tuberculous mycobacteria (NTM).
The reference genomes that reads were simulated from for these groups were gathered as follows. The references for the virus group were obtained using kraken's (v2.1.2) --download-library functionality. The viral library was downloaded on June 15 2023. The human genome from which the reads were simulated was KOREF_S1v2.1 (RefSeq accession GCA_020497085.1), with contigs shorter than 10kbp removed. The bacterial references were obtained by first downloading the bacteria library through kraken, followed by a subsampling due to the size (166Gb) of the resulting FASTA file. We subsampled the file by first removing sequences with a length <50kbp. We then extracted each sequence into its own FASTA file under a directory for the genus of the sequence - excluding the Mycobacterium genus. Genera were randomly subsampled to contain a maximum of 1000 assemblies. Each genus was then reduced to a representative subset using Assembly Dereplicator (commit 2dfcb14; https://github.com/rrwick/Assembly-Dereplicator) by keeping only 10% of the assemblies for each genus (-f 0.1). The NTM references selected were M. abscessus (accession GCF_017190695.1), M. avium (GCF_020735285.1), M. kansasii (GCA_014701265.1), M. ulcerans (GCF_000013925.1), M. intracellulare (GCF_016756075.1), M. terrae (GCF_010727125.1), and M. fortuitum (GCF_001307545.1). The MTBC reference is a lineage 1 assembly (GCF_932530395.1).
Illumina reads were simulated with ART (v2016.06.05). We simulated paired reads from a MiSeq v3 system (-ss MSv3) with a read length of 150, a mean fragment length of 250 and fragment length standard deviation 10 (-l 150 -m 250 -s 10).
We removed simulated Illumina reads with any ambiguous base.
创建时间:
2024-02-15



