bpRNA-NF-15.0: an RNA secondary structure dataset for family-wise evaluation

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/bpRNA-NF-15_0_an_RNA_secondary_structure_dataset_for_family-wise_evaluation/28468946

下载链接

链接失效反馈

官方服务：

资源简介：

The bpRNA-NF-15.0 is an RNA sequence and secondary structure dataset that exclusively includes new families from the ones contained in bpRNA-1m. Using bpRNA-1m as a training dataset and bpRNA-NF-15.0 as a test dataset, it is possible to check for generalization capabilities to unseen families. bpRNA-new is, to our knowledge, the only current other dataset designed for this. bpRNA-NF-15.0 is based on the latest Rfam version (15.0) and contains twice as many new families compared to bpRNA-new, as well as longer RNA sequences up to 951 nt, whereas bpRNA-new only contains RNA sequences shorter than 500 nt. We also provide here the Train, Validation and Test datasets described in our study. All three are built from bpRNA-1m. The Test dataset is a sequence-wise dataset in regards to the Train dataset. It is ensured that sequence similarities cannot exceed 80% between the two datasets, using the tool CDHIT-EST, but there may be RNA families in common. Each dataset contains 3 variables: - rna_name: the name of this sequence, as taken from the source dataset (Rfam for bpRNA-NF-15.0, or bpRNA-1m for Train / Validation / Test). - seq: the RNA sequence. - struct: its secondary structure in dot-bracket notation. The bpRNA-NF-15.0 dataset was extracted from Rfam 15.0, following a procedure similar to the one that was used to build bpRNA-new. First, RNA sequences were selected from Rfam 15.0, but only from families that are not included in Rfam 12.2. This is to ensure that no common families are found with bpRNA-1m, since bpRNA-1m was built from Rfam 12.2. Utility functions were applied to clean potential discrepancies, like converting sequence characters to capital letters, or ensuring efficient bracket representation. Non-canonical base pairs were removed. Then, the CDHIT-EST software was applied at an 80% similarity threshold to remove redundancies in the dataset. To cite this dataset, please use: Omnes L., Angel E., Bartet P., Tahi F. A divide-and-conquer approach based on deep learning for long RNA secondary structure prediction: focus on pseudoknots.

创建时间：

2025-02-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集