five

"Eukaryotic Genomic Windows for Swarm-Optimized Exon Detection"

收藏
DataCite Commons2026-02-05 更新2026-05-03 收录
下载链接:
https://ieee-dataport.org/documents/eukaryotic-genomic-windows-swarm-optimized-exon-detection
下载链接
链接失效反馈
官方服务:
资源简介:
"This dataset is derived from the benchmark GENSCAN training set, comprising 380 human genes (238 multi-exon and 142 single-exon sequences). To facilitate deep learning for exon detection, the raw genomic sequences {A, C, G, T} are transformed using one-hot encoding into 4 times numerical matrices. The data is structured using a sliding window approach, where fixed-length windows are labeled based on the annotation of the central nucleotide (Exon vs. Intron).To address the extreme biological class imbalance where non-coding regions vastly outnumber coding exons the Synthetic Minority Over-sampling Technique (SMOTE) is integrated into the pipeline. This technique generates synthetic minority class samples to ensure the model learns robust features for exon identification rather than biasing toward the majority class. The final dataset is organized into a 10-fold cross-validation structure to ensure rigorous evaluation and prevent data leakage."

本数据集源自基准GENSCAN训练集,共包含380个人类基因序列(其中238条为多外显子序列,142条为单外显子序列)。为便于开展外显子检测的深度学习任务,包含A、C、G、T四种碱基的原始基因组序列通过独热编码(one-hot encoding)转换为4维数值矩阵。本数据集采用滑动窗口法完成结构化处理,以固定长度窗口为单元,依据窗口中心核苷酸的注释信息进行标签标注(外显子与内含子分类)。针对非编码区域数量远多于编码外显子的极端生物类别不平衡问题,本数据处理流程集成了合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)。该技术可生成合成的少数类样本,确保模型能够学习到用于外显子识别的鲁棒特征,而非仅偏向多数类样本。最终数据集采用10折交叉验证(10-fold cross-validation)结构进行组织,以保障评估的严谨性并避免数据泄露(data leakage)。
提供机构:
IEEE DataPort
创建时间:
2026-02-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作