five

Simulated NGS read datasets for novel human virus prediction

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/3630803
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains simulated Illumina read datasets for novel human virus prediction and associated metadata extracted from the Virus Host Database (https://www.genome.jp/virushostdb/). The reads are 250bp long and were simulated with Mason (https://www.seqan.de/apps/mason/) from genomes downloaded from NCBI. The training-validation-test split was done on whole viral sequences to ensure "novelty" of validation and test viruses. The training sets contain 10 million reads per class, validation sets - 1.25 million reads per class, and test sets - 1.25 million paired reads per class. The negative class sets contain reads simulated from chordate-infecting ("cho"), metazoan-infecting ("met"), eukariote-infecting ("euk") and all-nonhuman viruses. The positive class contains human-infecting viruses. The stratified dataset ("strat") contains an equal number of reads from "cho", "met but not cho", "euk but not met" and "all but not euk".  Species-level datasets ("humspec", "allspec" and "chospec", with the corresponding fasta and *_species.rds files) are constructed analogously, but ensuring that all viruses of a given species were assigned to either training, val or test set. This is a stricter setting modelling a "novel viral species" scenario while reflecting within-species phenotype diversity. blast_hits.gz contains blast hits of human virome reads form Moustafa et al., 2017 (https://doi.org/10.1371/journal.ppat.1006292) blasted against our training database (see paper for details). In the second column you can find the matched label and the accession number of the matched reference. blast_labels_complete.gz contains extracted labels for all virome reads, including those without any matches. Note: one of the read headers (>3c8ac47039d32b11c8fe23f588e444e9) from Moustafa et al. is slightly corrupted with null characters. You can remove them with sed 's/\x0//g' or equivalent.

本仓库收录了用于新型人类病毒预测的模拟Illumina测序读段数据集,以及从病毒宿主数据库(Virus Host Database,https://www.genome.jp/virushostdb/)中提取的关联元数据。该测序读段长度为250bp,通过Mason测序模拟器(https://www.seqan.de/apps/mason/),基于从NCBI下载的病毒基因组模拟生成。训练-验证-测试划分以完整病毒序列为单位进行,以确保验证集与测试集的病毒具备“新颖性”。训练集每类包含1000万条读段,验证集与测试集每类分别包含125万条读段与125万条配对读段。负样本类别包含从感染脊索动物("cho")、感染后生动物("met")、感染真核生物("euk")以及所有非人类病毒中模拟得到的读段;正样本类别则包含感染人类的病毒。分层数据集("strat")中,"感染脊索动物(cho)"、"感染后生动物但非脊索动物(met but not cho)"、"感染真核生物但非后生动物(euk but not met)"以及"所有非真核生物感染病毒(all but not euk)"的读段数量保持均等。 物种级数据集("humspec"、"allspec"与"chospec",对应配套文件为fasta格式及*_species.rds文件)的构建逻辑与上述类似,但要求同一物种的所有病毒均被划分至训练、验证或测试集中的单一集合。该设置更为严格,可模拟“新型病毒物种”场景,同时兼顾物种内的表型多样性。 blast_hits.gz文件包含来自Moustafa等人2017年研究(https://doi.org/10.1371/journal.ppat.1006292)的人类病毒组读段,与本仓库训练数据库进行BLAST比对后的结果(详细信息参见论文),其第二列包含匹配到的标签以及对应参考序列的登录号。blast_labels_complete.gz文件则包含了所有病毒组读段的提取标签,其中也包含无任何匹配结果的读段。注意:Moustafa等人研究中的一条读段头部(>3c8ac47039d32b11c8fe23f588e444e9)存在空字符损坏,可通过`sed 's/x0//g'`或等效命令修复。
创建时间:
2020-12-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作