K-FluDB
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14261913
下载链接
链接失效反馈官方服务:
资源简介:
K-FluDB: A Novel K-Mer Based Database for Enhanced Genomic Surveillance of Influenza A Viruses
K-FluDB is a compressed database composed of distinct sub-sequences specific to 50 influenza A subtypes. It includes unique sequences for all 18 hemagglutinin (HA) and 11 neuraminidase (NA) subtypes. The original influenza sequences were obtained from the NCBI database on May 8, 2022, comprising a total of 895,900 Influenza A sequences.
To generate this database, sequences were first subsampled based on genomic segment and variant group, resulting in 81,262 sequences. These sequences served as input for the PanGen-InfluenzaA tool (GitHub), which constructs pangenomes by identifying both unique subtype-specific sequences and sequences shared across multiple subtypes.
Repository Contents
This repository contains the following files:
pangenome.fasta: The complete influenza A pangenome, including both unique (subtype-specific) and non-unique sequences.
pangenome_unique_4.fasta: Unique sequences from segment 4, which can be used to classify Hx subtypes.
pangenome_unique_6.fasta: Unique sequences from segment 6, suitable for identifying Nx subtypes.
pangenome_unique_46.fasta: A combined dataset containing unique sequences from both segment 4 and segment 6.
Additionally, within the segment_4 and segment_6 directories, there are individual files corresponding to each influenza A subtype. These files contain subtype-specific unique sequences and are named according to their respective NCBI taxonomic identifiers.
Compression Efficiency and Classification Accuracy
K-FluDB achieves a relative compression index of 96.54% when using the complete pangenome and 99.64% when considering only subtype-specific sequences. The average precision for correctly classifying Hx and Nx subtypes using the unique sequences is 99.2% and 99.71%, respectively.
This database provides a highly efficient and accurate resource for influenza A subtype classification while significantly reducing the storage and computational requirements associated with full-genome analyses.
Acknowledgements
This research was partially supported by grants by PAPIIT-DGAPA-IN230523 awarded to BT. The first author gratefully acknowledges the scholarship provided by CONAHCYT. We also extend our sincere appreciation to the National Autonomous University of Mexico (UNAM) for granting access to the MIZTLI supercomputer, supported by the General Directorate of Computing and Information and Communication Technologies (DGTIC) through project LANCAD-UNAM-DGTIC-350.
创建时间:
2025-03-21



