five

Optimal feature size for all experiments.

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Optimal_feature_size_for_all_experiments_/27090652
下载链接
链接失效反馈
官方服务:
资源简介:
DNA splice junction classification is a crucial job in computational biology. The challenge is to predict the junction type (IE, EI, or N) from a given DNA sequence. Predicting junction type is crucial for understanding gene expression patterns, disease causes, splicing regulation, and gene structure. The location of the regions where exons are joined, and introns are removed during RNA splicing is very difficult to determine because no universal rule guides this process. This study presents a two-layer hybrid approach inspired by ensemble learning to overcome this challenge. The first layer applies the grey wolf optimizer (GWO) for feature selection. GWO’s exploration ability allows it to efficiently search a vast feature space, while its exploitation ability refines promising areas, thus leading to a more reliable feature selection. The selected features are then fed into the second layer, which employs a classification model trained on the retrieved features. Using cross-validation, the proposed method divides the DNA splice junction dataset into training and test sets, allowing for a thorough examination of the classifier’s generalization ability. The ensemble model is trained on various partitions of the training set and tested on the remaining held-out fold. This process is performed for each fold, comprehensively evaluating the classifier’s performance. We tested our method using the StatLog DNA dataset. Compared to various machine learning models for DNA splice junction prediction, the proposed GWO+SVM ensemble method achieved an accuracy of 96%. This finding suggests that the proposed ensemble hybrid approach is promising for DNA splice junction classification. The implementation code for the proposed approach is available at https://github.com/EFHamouda/DNA-splice-junction-prediction.

DNA剪接位点分类是计算生物学中的关键研究任务。其核心挑战在于,从给定的DNA序列中预测剪接位点类型(内含子-外显子(IE)、外显子-内含子(EI)以及非剪接位点(N))。预测剪接位点类型,对于理解基因表达模式、疾病致病机制、剪接调控机制以及基因结构均具有重要意义。由于尚无通用规则可循,RNA剪接过程中外显子连接、内含子移除区域的定位极具挑战。本研究提出了一种受集成学习启发的双层混合分类方法,以攻克这一难题。第一层采用灰狼优化器(Grey Wolf Optimizer, GWO)进行特征选择:灰狼优化器的探索能力使其能够高效遍历庞大的特征空间,而其开发能力则可对高潜力区域进行精细化搜索,从而实现更为可靠的特征选择。所筛选出的特征将被输入至第二层,该层采用基于提取得到的特征训练得到的分类模型。通过交叉验证,所提方法将DNA剪接位点数据集划分为训练集与测试集,以全面评估分类器的泛化能力。集成模型在训练集的不同划分子集上进行训练,并在剩余的留出折上开展测试,该流程针对每一折均执行一次,以此全面评估分类器的整体性能。我们采用StatLog DNA数据集对所提方法进行了测试。相较于多款用于DNA剪接位点预测的机器学习模型,所提出的GWO+支持向量机(Support Vector Machine, SVM)集成方法实现了96%的分类准确率。这一结果表明,所提出的集成混合方法在DNA剪接位点分类任务中具备良好的应用前景。本研究提出方法的实现代码可于https://github.com/EFHamouda/DNA-splice-junction-prediction获取。
创建时间:
2024-09-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作