H5N1 Wastewater Detection Demo Dataset
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14638948
下载链接
链接失效反馈官方服务:
资源简介:
# H5N1 Wastewater Detection Demo Dataset
7
8## Overview
9This dataset combines simulated H5N1 influenza reads with real wastewater metagenome data to create a benchmark for viral detection methods. It simulates a scenario where a novel H5N1 strain is present in urban wastewater at detectable levels.
10
11## Dataset Composition
12- Total reads: 707,830
13 - H5N1 reads: 1,120 (0.16%)
14 - Wastewater reads: 706,710 (99.84%)
15
16### Viral Content Breakdown
Total Viruses (0.34% of all reads):├── H5N1 (0.16%)├── Caudovirales (0.18%)│ ├── Siphoviridae (0.12%)│ └── Podoviridae (0.06%)├── Microviridae (0.08%)└── Other viruses (0.02%)
Collapse
1
2## Data Sources
3
4### H5N1 Component
5- Source: Influenza A virus (A/chicken/Egypt/N19604C/2021(H9N2))
6- NCBI Accessions:
7 - PB2: ON374267.1
8 - PB1: ON374268.1
9 - PA: ON374269.1
10 - HA: ON374270.1 (Modified with mutations)
11 - NP: ON374271.1
12 - NA: ON374272.1
13 - M: ON374273.1
14 - NS: ON374266.1
15
16#### Modifications
17- Mutation rate: 0.1% (introduced using wgsim)
18- Error rate: 0.1%
19- Coverage: 10x
20- Read length: 150bp
21- Sequencing profile: HiSeq 2500
22
23### Wastewater Component
24- Source: Global Urban Virome Project
25- Accession: ERR2734409
26- Original composition preserved
27- Represents typical urban wastewater viral diversity
28
29## Directory Structure
combined_data/├── input/│ ├── fasta/│ │ ├── h5n1.fasta│ │ └── wastewater.fasta│ ├── h5n1/│ │ ├── final_reads.fq│ │ └── h5n1_complete.fasta│ ├── uncompressed/│ │ └── ERR2734409.fastq│ └── wastewater/│ └── ERR2734409.fastq.gz├── scripts/│ ├── combine_segments.py│ ├── convert_fastq.py│ └── workflow.sh└── README.md
1
2## Dataset Creation Method
3
4### 1. H5N1 Sequence Preparation
5```bash
6# Introduce mutations in HA segment
7wgsim -N 100000 -e 0.001 -r 0.001 -R 0.0 ON374266.1.fasta HA_mutated_1.fq
8
9# Convert mutated sequence to FASTA
10python3 convert_fastq.py
11
12# Combine all segments
13python3 combine_segments.py
2. Read Simulation
BASH
1# Generate Illumina reads using ART
2art_illumina -i h5n1_complete.fasta -l 150 -ss HS25 -f 10 -sam -nf 0 -o final_reads
3. Dataset Combination
BASH
1# Combine H5N1 and wastewater reads
2cat final_reads.fq ERR2734409.fastq > combined_reads.fastq
Validation Results
Python
1Dataset Statistics:
2==================
3Total reads: 707,830
4H5N1: 1,120 reads (0.16%)
5Wastewater: 706,710 reads (99.84%)
6
7Read Lengths:
8Mean: 150.0
9Std dev: 0.0
10
11GC Content:
12Mean: 47.2%
13Std dev: 8.4%
Citations
Nieuwenhuijse, D.F., Oude Munnink, B.B., Phan, M.V.T. et al. Setting a baseline for global urban virome surveillance in sewage. Sci Rep 10, 13748 (2020). https://doi.org/10.1038/s41598-020-69869-0
Li H. wgsim - Read simulator for next generation sequencing. (2011).
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012 Feb 15;28(4):593-4.
创建时间:
2025-01-13



