Simulated NGS read datasets for novel human virus prediction

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/3630803

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains simulated Illumina read datasets for novel human virus prediction and associated metadata extracted from the Virus Host Database (https://www.genome.jp/virushostdb/). The reads are 250bp long and were simulated with Mason (https://www.seqan.de/apps/mason/) from genomes downloaded from NCBI. The training-validation-test split was done on whole viral sequences to ensure "novelty" of validation and test viruses. The training sets contain 10 million reads per class, validation sets - 1.25 million reads per class, and test sets - 1.25 million paired reads per class. The negative class sets contain reads simulated from chordate-infecting ("cho"), metazoan-infecting ("met"), eukariote-infecting ("euk") and all-nonhuman viruses. The positive class contains human-infecting viruses. The stratified dataset ("strat") contains an equal number of reads from "cho", "met but not cho", "euk but not met" and "all but not euk". Species-level datasets ("humspec", "allspec" and "chospec", with the corresponding fasta and *_species.rds files) are constructed analogously, but ensuring that all viruses of a given species were assigned to either training, val or test set. This is a stricter setting modelling a "novel viral species" scenario while reflecting within-species phenotype diversity. blast_hits.gz contains blast hits of human virome reads form Moustafa et al., 2017 (https://doi.org/10.1371/journal.ppat.1006292) blasted against our training database (see paper for details). In the second column you can find the matched label and the accession number of the matched reference. blast_labels_complete.gz contains extracted labels for all virome reads, including those without any matches. Note: one of the read headers (>3c8ac47039d32b11c8fe23f588e444e9) from Moustafa et al. is slightly corrupted with null characters. You can remove them with sed 's/\x0//g' or equivalent.

本仓库收录了用于新型人类病毒预测的模拟Illumina测序读段数据集，以及从病毒宿主数据库（Virus Host Database，https://www.genome.jp/virushostdb/）中提取的关联元数据。该测序读段长度为250bp，通过Mason测序模拟器（https://www.seqan.de/apps/mason/），基于从NCBI下载的病毒基因组模拟生成。训练-验证-测试划分以完整病毒序列为单位进行，以确保验证集与测试集的病毒具备“新颖性”。训练集每类包含1000万条读段，验证集与测试集每类分别包含125万条读段与125万条配对读段。负样本类别包含从感染脊索动物（"cho"）、感染后生动物（"met"）、感染真核生物（"euk"）以及所有非人类病毒中模拟得到的读段；正样本类别则包含感染人类的病毒。分层数据集（"strat"）中，"感染脊索动物（cho）"、"感染后生动物但非脊索动物（met but not cho）"、"感染真核生物但非后生动物（euk but not met）"以及"所有非真核生物感染病毒（all but not euk）"的读段数量保持均等。物种级数据集（"humspec"、"allspec"与"chospec"，对应配套文件为fasta格式及*_species.rds文件）的构建逻辑与上述类似，但要求同一物种的所有病毒均被划分至训练、验证或测试集中的单一集合。该设置更为严格，可模拟“新型病毒物种”场景，同时兼顾物种内的表型多样性。 blast_hits.gz文件包含来自Moustafa等人2017年研究（https://doi.org/10.1371/journal.ppat.1006292）的人类病毒组读段，与本仓库训练数据库进行BLAST比对后的结果（详细信息参见论文），其第二列包含匹配到的标签以及对应参考序列的登录号。blast_labels_complete.gz文件则包含了所有病毒组读段的提取标签，其中也包含无任何匹配结果的读段。注意：Moustafa等人研究中的一条读段头部（>3c8ac47039d32b11c8fe23f588e444e9）存在空字符损坏，可通过`sed 's/x0//g'`或等效命令修复。

创建时间：

2020-12-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集