five

Supplemental tables for 'Integrating terminal-free sequence modeling and explainability to resolve 'dark matter' in algal genomics'

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14029554
下载链接
链接失效反馈
官方服务:
资源简介:
AI language models (LMs) show promise for microbial sequence classification. We re-engineered open-source LMs (e.g., GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70 million (m) to 12 billion (B) parameters) to classify genomic ‘dark-matter’ using translated ORFeomes from 166 species in ten phyla. The total training data was comprised of ~77 million sequences. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified uncharacterized sequence from non-model algal species (average of ~65% of total translated ORFs), validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (> 1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.   Table S1 | External spreadsheet. This spreadsheet contains LA4SR performance metrics, technical performance estimations, and preliminary BLAST screening results of genomes comprising the algal training data. Table S2 | External spreadsheet. Captum attributions for 100 sequences each of algal and bacterial origin obtained using the LayerIntegratedGradients function. Table S3 | External spreadsheet. Influential motifs found with the DeepMotifMinerPro software introduced in this work (see Data S3). Table S4 | External spreadsheet. LA4SR and BLAST results for new, real-world sequencing data (uploaded to NCBI SRA accession SUB14799921, related to Fig. S3).
创建时间:
2025-02-14
二维码
社区交流群
二维码
科研交流群
商业服务