LA4SR - complete Hi-C/Pacbio Chlamydomonas reinhardtii CC-18831 genome
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1184480
下载链接
链接失效反馈官方服务:
资源简介:
AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (e.g., GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70 million (m) to 12 billion (B) parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal 'dark proteome', (e.g., uncharacterized proteins comprising ~65% of total proteins), validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (> 1B) LA4SR models reached high accuracy (F1> 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.
创建时间:
2024-11-11



