Data for: Ensemble Learning: Predicting Human Pathogenicity of Hematophagous Arthropod Vector-Borne Viruses
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/hvk9f2by6k
下载链接
链接失效反馈官方服务:
资源简介:
Overview: This dataset supports the study on predicting the human pathogenicity of viruses carried by blood-feeding arthropods (mosquitoes, ticks, etc.) using ensemble learning. It integrates large-scale epidemiological data with genomic functional annotations to assess zoonotic spillover risks.
Dataset Components:
Epidemiological Characteristics Dataset: Covers 294 viruses and 37 distinct features categorized into:
Virus Properties: Baltimore classification and taxonomy.
Vector Host Features: Family and genus of vectors (e.g., Culicidae, Ixodidae).
Non-vector Host Diversity: Distribution across 15 groups, emphasizing the impact of Perissodactyla and Carnivora orders on pathogenicity.
Viral Sequence Pathogenic Function Dataset: Includes functional annotations for 71,623 viral sequences. Using SeqScreen, 10 key Functional Signatures of Concern (FunSoCs) were identified, such as:
Viral Adhesion: Found in 62% of sequences, crucial for host cell entry.
Host Xenophagy & Viral Counter Signaling: Key features for immune evasion.
Viral Invasion: Associated with non-pathogenic traits in this specific context.
Technical Application: The data were utilized to develop and validate XGBoost-based models:
Regression Model: Achieved an R² of 90.6%, correlating host diversity with pathogenicity.
Classification Model: Achieved an F1 score of 96.79% for identifying pathogenic potential at the sequence level.
External Validation: Includes predictions for 228 sequences, highlighting potential risks from Palma and Zaliv Terpeniya viruses.
Research Value: This resource allows for the strain-level prediction of pathogenicity within metagenomic data, providing a robust framework for early warning systems of emerging zoonotic threats.
创建时间:
2026-01-26



