Diverse database and machine learning model to narrow the generalization gap in RNA structure prediction
收藏DataCite Commons2026-03-05 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.79cnp5j95
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains RNA secondary structure data used for training and
testing eFold, a deep learning model for RNA secondary structure
prediction. The dataset comprises three main components: (1)
experimentally determined secondary structure models for 1,098 pri-miRNAs
and 1,456 human mRNA regions derived from DMS-MaP-seq chemical probing
experiments, representing the original contribution of this work; (2) a
curated pre-training dataset combining subsets of bpRNA (base-pair RNA
database) and RNAstralign databases, filtered to remove redundant
sequences and ArchiveII sequences as described in the associated
publication; and (3) benchmark test sets for evaluating model performance
on long and diverse RNA structures. The dataset includes sequence files in
FASTA format and corresponding secondary structure annotations in
dot-bracket notation. Structure models represent experimentally validated
folding patterns with reactivity data from chemical probing assays. The
pri-miRNA structures range from 200 nucleotides in length and include
precursor hairpins with flanking regions, while mRNA structures range from
200-1kb and focus on functionally important regions including
3' untranslated regions. This dataset enables researchers to: (1)
train and benchmark machine learning models for RNA structure prediction,
particularly for long and complex RNAs that have been traditionally
difficult to predict; (2) investigate RNA structural features in
pri-miRNAs and mRNA regulatory regions; (3) compare performance of
computational methods against experimentally determined structures; and
(4) develop improved algorithms that incorporate diverse RNA families
beyond the short non-coding RNAs that dominate existing training sets. All
data are freely available without restrictions. No human subjects data or
personally identifiable information is included. RNA sequences are derived
from publicly available reference genomes and databases.
提供机构:
Dryad
创建时间:
2026-01-29



