five

Supporting data for "The complexity landscape of viral genomes"

收藏
DataCite Commons2025-05-26 更新2025-04-15 收录
下载链接:
http://gigadb.org/dataset/102241
下载链接
链接失效反馈
官方服务:
资源简介:
Viruses are amongst the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. With the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes organization, relation, and fundamental characteristics.<br>This work provides a comprehensive landscape of the viral genomes complexity (or quantity of information),identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including sub-sequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that dsDNA viruses are on average the most redundant viruses while ssDNA viruses are the lowest. Contrarily, dsRNA viruses show a lower redundancy relative to ssRNA. We extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis to human Herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without using direct comparisons between sequences. This methodology works by combining data compression with simple measures such as GC-content percentage and sequence length followed by machine learning classifiers. <br>This manuscript presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.
提供机构:
GigaScience Database
创建时间:
2022-07-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作