Thesis on Global Patterns of Sampling Bias in Molecular Sequences of Vertebrate Viruses: Supplementary Data and Code
收藏DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20039252
下载链接
链接失效反馈官方服务:
资源简介:
Supplementary Data and Code from my thesis on "Global Patterns of Sampling Bias in Molecular Sequences of Vertebrate Viruses".
Virus family data was attained from NCBI Virus.
GDP and population size data was attained from World Bank.
Total number of animal species (biodiversity proxy) data was attained from IUCN Red List.
NCBI Virus information per family can be found as .csv files in format "NCBI_VirusFamily_09042026.csv".
For data on Coronaviridae, Orthomyxoviridae and Retroviridae, refer to .fst files which can be read into R; how these files were processed can be seen in the code; they contain the same information as the other virus families but just in different format for more efficient processing.
Code files:
loading_NCBI_Virus_datasets (needs to be run first, before any of the other code files); once run, the other code files can be run
explorations_of_geographic_bias (code for maps, country-level data, log-log models, k-means clustering)
explorations_of_taxonomic_bias (code for virus family bar chart, distinct vertebrate host species information, phylogenetic heatmap)
explorations_of_temporal_bias (code for cumulative discovery curves, discovery rates, Kruskal-Wallis tests)
For the pipeline used to attain taxonomy information via Taxonkit, "Taxonkit Taxonomic Information.txt". Contains the lines of code used and short description of code.
GDP, population and biodiversity (IUCN_species_info) data are saved as .csv.
提供机构:
Zenodo
创建时间:
2026-05-05



