five

Dataset for training SENMAP, a automatic tool to curate LTR-retrotransposons using convolutional neural networks

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11243248
下载链接
链接失效反馈
官方服务:
资源简介:
Transposable elements (TEs) are specific structures of the genome of species, which can move from one location to another. For that reason, they can cause mutations or changes that can be negative, such as the appearance of diseases, or beneficial, such as participating in fundamental roles in the evolution of genomes and genetic diversity. Long Terminal Repeat retrotransposons (LTR-RT) are the most abundant in plant species, hence the importance of studying these structures in particular. Over the time, these elements can suffer changes called nested insertions, which can inactivate or modify the functioning of the element, for that they are no longer consider as intact element and cannot be used for identification and classification studies. We create a dataset containing 56,442 LTR-RTs targed as "non-intact" elements and 49,215 considered as "intact".  We formated the sequences IDs in order to keep relevant information as the superfamily and the lineage, as well as the category (Negative for "non-intact" and Positive for "intact" elements).   This dataset (the npy files obtained from the fasta file) was used for training SENMAP, a convolutional neural network architecture to obtain intact LTR-RT sequences in plant genomes, which is composed by four convolutional layers, LeakyReLU as activation function and BinaryFocalLoss as loss function. Achieving an F1-score percentage of 91.37% with test data, identifying low quality sequences rapidly and efficiently, contributing to curate libraries of LTR retrotransposons of plants genomes published in large-scale sequencing projects due to the post-genomic era.
创建时间:
2024-05-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作