Diverse database and machine learning model to narrow the generalization gap in RNA structure prediction

Name: Diverse database and machine learning model to narrow the generalization gap in RNA structure prediction
Creator: Dryad
Published: 2026-03-05 23:17:34
License: 暂无描述

DataCite Commons2026-03-05 更新2026-04-25 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.79cnp5j95

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains RNA secondary structure data used for training and testing eFold, a deep learning model for RNA secondary structure prediction. The dataset comprises three main components: (1) experimentally determined secondary structure models for 1,098 pri-miRNAs and 1,456 human mRNA regions derived from DMS-MaP-seq chemical probing experiments, representing the original contribution of this work; (2) a curated pre-training dataset combining subsets of bpRNA (base-pair RNA database) and RNAstralign databases, filtered to remove redundant sequences and ArchiveII sequences as described in the associated publication; and (3) benchmark test sets for evaluating model performance on long and diverse RNA structures. The dataset includes sequence files in FASTA format and corresponding secondary structure annotations in dot-bracket notation. Structure models represent experimentally validated folding patterns with reactivity data from chemical probing assays. The pri-miRNA structures range from 200 nucleotides in length and include precursor hairpins with flanking regions, while mRNA structures range from 200-1kb and focus on functionally important regions including 3' untranslated regions. This dataset enables researchers to: (1) train and benchmark machine learning models for RNA structure prediction, particularly for long and complex RNAs that have been traditionally difficult to predict; (2) investigate RNA structural features in pri-miRNAs and mRNA regulatory regions; (3) compare performance of computational methods against experimentally determined structures; and (4) develop improved algorithms that incorporate diverse RNA families beyond the short non-coding RNAs that dominate existing training sets. All data are freely available without restrictions. No human subjects data or personally identifiable information is included. RNA sequences are derived from publicly available reference genomes and databases.

提供机构：

Dryad

创建时间：

2026-01-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集