Data Sheet 1_Deep learning-based investigation of chloroplast translation regulatory sequences.csv

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_1_Deep_learning-based_investigation_of_chloroplast_translation_regulatory_sequences_csv/30857222

下载链接

链接失效反馈

官方服务：

资源简介：

Understanding the architecture of translational regulatory sequences in diverse chloroplasts is critical for advancing synthetic biology and genetic engineering. In this study, a hybrid deep learning model combining convolutional neural network (CNN), long short-term memory (LSTM), Attention, and Residual architectures was developed to classify and analyse two datasets: 5′ untranslated region sequences from plants and algae, and the sequences with and without Shine-Dalgarno (SD) motifs from both groups. Using 300-nucleotide leader sequences upstream of the start codon as input, the model achieved strong prediction performance for both taxonomic origin and the presence or absence of SD motifs. However, a small subset of plant and algal sequences exhibited algal-like and plant-like patterns, respectively—an encouraging finding for identifying functional heterologous sequences from one group for use in the other group’s genome. The results further revealed significant differences in the plastid leader sequences between the datasets (Plants vs. Algae and SDs vs. without SDs), emphasising distinct features in the first 30 bp upstream of the start codon. This study proposes two potential strategies for introducing heterologous leader sequences in algal plastome engineering: (1) employing plant-derived leader sequences with algal-like patterns tailored to specific algal strains, and (2) constructing hybrid leader sequences harbouring SD motifs by fusing algae-specific ~30 bp upstream regions with their respective plant-derived distal regions. As the first deep learning model to analyse chloroplast translational regulatory sequences, the findings offer valuable guidance for identifying and predicting heterologous leader sequences in plants and algae.

解析多样叶绿体中转译调控序列的结构，对于推动合成生物学与基因工程的发展至关重要。本研究构建了一种融合卷积神经网络（CNN）、长短期记忆网络（LSTM）、注意力机制（Attention）与残差架构（Residual）的混合深度学习模型，用于分类与分析两类数据集：其一为来自植物与藻类的5′非翻译区（5′ untranslated region）序列；其二为两类生物中携带与不携带Shine-Dalgarno（SD）基序的序列。该模型以起始密码子上游300个核苷酸的先导序列作为输入，在分类物种起源以及预测SD基序存在与否两项任务中均取得了优异的预测性能。然而，少量植物与藻类序列分别呈现出类藻类与类植物的特征模式——这一发现令人鼓舞，有助于从一类生物中筛选可用于另一类生物基因组的功能性异源序列。研究结果进一步揭示了不同数据集（植物vs藻类、有SD基序vs无SD基序）间质体先导序列的显著差异，凸显了起始密码子上游前30个碱基对的独特特征。本研究提出了两种可用于藻类质体基因组工程的异源先导序列引入策略：(1) 针对特定藻类菌株，采用具备类藻类特征的植物来源先导序列；(2) 通过将藻类特异性的上游~30 bp区域与其对应的植物来源远端区域融合，构建携带SD基序的混合先导序列。作为首个用于分析叶绿体转译调控序列的深度学习模型，本研究结果可为植物与藻类中异源先导序列的筛选与预测提供极具价值的指导。

创建时间：

2025-12-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集