five

Campylobacter dataset with simulated inter and intra genus contaminations

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4601405
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset serves for the detection of inter and intra genus contaminations in Campylobacter. Its design follows the concepts presented in https://doi.org/10.1186/s13059-019-1914-x: Illumina reads from complete Campylobacter genomes were simulated and artificially mixed at different concentrations and genomic distances. The mixed reads were assembled using shovill. This dataset contains the following files: simulated_reads.tar: The simulated reads for 248 Campylobacter samples assemblies.tar.gz: Shovill assembly of all read data genome_info_Ca.tsv: Description of the original complete Campylobacter genomes metadata_Ca.tsv: Mixing information for all provided samples   Details: We downloaded all complete Campylobacter genomes from NCBI refseq. Next, we computed the MLST ST using mlst (https://github.com/tseemann/mlst) and excluded all samples without an ST, resulting in a final dataset of 218 samples. We then determined the genetic similarity between these samples by computing pairwise MLST allele distances (usinghttps://github.com/tseemann/cgmlst-dists). For each sample, we attempted to find a close, intermediate and distant matching sample following the proposed definition of https://doi.org/10.1186/s13059-019-1914-x: close (same ST, 0 AD), intermediate (2-6 AD), distant (7 AD). We selected two C. coli and six C. jejuni samples with at least one close, intermediate and distant matching sample. For each species we selected genomes with maximal overall genomic diversity and simulated reads from the selected genomes using ART_Illumina v2.5.8 (see FDA for details). Next, we combined reads from the eight samples and their respective matching samples using the script select_reads.pl (http://github.com/apightling/contamination), in order to create simulated contaminated datasets. Additionally, we created inter-genus contaminants by mixing reads of the eight Campylobacter spp. samples with reads from any of the other three genera (Listeria, Salmonella, Escherichia) of the analogous FDA dataset (https://doi.org/10.6084/m9.figshare.c.4282706.v1).
创建时间:
2021-03-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作