five

Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://www.ncbi.nlm.nih.gov/sra/SRP682234
下载链接
链接失效反馈
官方服务:
资源简介:
Predicting gene expression from cis-regulatory DNA sequences is a central challenge in plant genomics. Here, we developed deep learning sequence-to-expression (S2E) models that leverage high-dimensional representations from auxiliary foundational models (genomic language model PlantCaduceus, chromatin accessibility model a2z) instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. We first evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model's prediction accuracy across all species outperforms PhytoExpr, a state-of-the-art (SOTA) S2E model trained on the same dataset (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed the SOTA models in predicting between-gene expression differences (regression coefficient ß=0.78 vs. ß=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA models showed only weak associations (regression coefficient ß=0.38 vs. ß=0.08). Our results demonstrate the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future studies. Overall design: RNA-seq profiling of independent lines in Brachypodium distachyon, derived from a single genotype (Bd21-3): 769 mutant lines, 27 control lines. Mutant lines were generated by chemical mutagenesis: seed treatment by 7 mM of sodium azide. Control lines were generated by a neutral treatment: seed treatment by 0 mM of sodium azide. RNA-seq measurements were taken at the M6 generation (six generations after mutagenesis) under controlled conditions in growth chambers.
创建时间:
2026-03-09
二维码
社区交流群
二维码
科研交流群
商业服务