five

Simulated Illumina metagenomic reads

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8339790
下载链接
链接失效反馈
官方服务:
资源简介:
We simulated metagenomic Illumina sequencing reads to a mixture ratio that approximates that found in patient sputa, albeit with a slightly higher mycobacterial component. In total, 0.9 gigabases were generated, at proportions: 46% each for bacteria and human, 6\% Mycobacterium tuberculosis complex (MTBC), and 1% each for virus and non-tuberculous mycobacteria (NTM). The reference genomes that reads were simulated from for these groups were gathered as follows. The references for the virus group were obtained using kraken's (v2.1.2) --download-library functionality. The viral library was downloaded on June 15 2023. The human genome from which the reads were simulated was KOREF_S1v2.1 (RefSeq accession GCA_020497085.1), with contigs shorter than 10kbp removed. The bacterial references were obtained by first downloading the bacteria library through kraken, followed by a subsampling due to the size (166Gb) of the resulting FASTA file. We subsampled the file by first removing sequences with a length <50kbp. We then extracted each sequence into its own FASTA file under a directory for the genus of the sequence - excluding the Mycobacterium genus. Genera were randomly subsampled to contain a maximum of 1000 assemblies. Each genus was then reduced to a representative subset using Assembly Dereplicator (commit 2dfcb14; https://github.com/rrwick/Assembly-Dereplicator) by keeping only 10% of the assemblies for each genus (-f 0.1). The NTM references selected were M. abscessus (accession GCF_017190695.1), M. avium (GCF_020735285.1), M. kansasii (GCA_014701265.1), M. ulcerans (GCF_000013925.1), M. intracellulare (GCF_016756075.1), M. terrae (GCF_010727125.1), and M. fortuitum (GCF_001307545.1). The MTBC reference is a lineage 1 assembly (GCF_932530395.1). Illumina reads were simulated with ART (v2016.06.05). We simulated paired reads from a MiSeq v3 system (-ss MSv3) with a read length of 150, a mean fragment length of 250 and fragment length standard deviation 10 (-l 150 -m 250 -s 10). We removed simulated Illumina reads with any ambiguous base.
创建时间:
2024-02-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作