Supporting data for "MBGC2: Boosting compression via efficient encoding of approximate matches in genome collections"

Name: Supporting data for "MBGC2: Boosting compression via efficient encoding of approximate matches in genome collections"
Creator: GigaScience Database
Published: 2026-01-19 08:00:01
License: 暂无描述

DataCite Commons2026-01-19 更新2026-05-03 收录

下载链接：

https://gigadb.org/dataset/102796/

下载链接

链接失效反馈

官方服务：

资源简介：

FASTA is the primary format for representing DNA, RNA and protein sequences. While progress has been made in specialized FASTA collection compressors, they still struggle with practical limitations and inconsistent performance across different datasets, hindering effective storage and transfer of large genomic datasets. <br> We present an enhanced version of the Multiple Bacteria Genome Compressor (MBGC), a high-throughput, in-memory algorithm for compressing genome collections. It relies on information about maximum exact matches in the compressed set to identify possibly long approximate matches. It encodes them even when they partially overlap, boosting the compression ratio by an average of 14% across bacterial datasets, while the reengineered multi-threaded decoding speeds up decompression compared to its predecessor by around 40%. The compression ratio improvement is even more pronounced on other collections, for H. sapiens reaching 18%, and up to 55% for S. paradoxus. <br> MBGC2 performs consistently across diverse datasets and introduces practical features to ease data management such as archive appending, repacking, fast content listing and flexible decompression options. Benchmark tests covering nucleotide-based bacterial, viral, and human genome collections show that MBGC2 combines compression efficiency and processing speed. The tool supports working with single genomes or amino acid collections, but does not guarantee such high performance in these cases.

提供机构：

GigaScience Database

创建时间：

2026-01-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集