five

bpRNA-NF-15.0: an RNA secondary structure dataset for family-wise evaluation

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/bpRNA-NF-15_0_an_RNA_secondary_structure_dataset_for_family-wise_evaluation/28468946
下载链接
链接失效反馈
官方服务:
资源简介:
The bpRNA-NF-15.0 is an RNA sequence and secondary structure dataset that exclusively includes new families from the ones contained in bpRNA-1m. Using bpRNA-1m as a training dataset and bpRNA-NF-15.0 as a test dataset, it is possible to check for generalization capabilities to unseen families. bpRNA-new is, to our knowledge, the only current other dataset designed for this. bpRNA-NF-15.0 is based on the latest Rfam version (15.0) and contains twice as many new families compared to bpRNA-new, as well as longer RNA sequences up to 951 nt, whereas bpRNA-new only contains RNA sequences shorter than 500 nt. We also provide here the Train, Validation and Test datasets described in our study. All three are built from bpRNA-1m. The Test dataset is a sequence-wise dataset in regards to the Train dataset. It is ensured that sequence similarities cannot exceed 80% between the two datasets, using the tool CDHIT-EST, but there may be RNA families in common. Each dataset contains 3 variables: - rna_name: the name of this sequence, as taken from the source dataset (Rfam for bpRNA-NF-15.0, or bpRNA-1m for Train / Validation / Test). - seq: the RNA sequence. - struct: its secondary structure in dot-bracket notation. The bpRNA-NF-15.0 dataset was extracted from Rfam 15.0, following a procedure similar to the one that was used to build bpRNA-new. First, RNA sequences were selected from Rfam 15.0, but only from families that are not included in Rfam 12.2. This is to ensure that no common families are found with bpRNA-1m, since bpRNA-1m was built from Rfam 12.2. Utility functions were applied to clean potential discrepancies, like converting sequence characters to capital letters, or ensuring efficient bracket representation. Non-canonical base pairs were removed. Then, the CDHIT-EST software was applied at an 80% similarity threshold to remove redundancies in the dataset. To cite this dataset, please use: Omnes L., Angel E., Bartet P., Tahi F. A divide-and-conquer approach based on deep learning for long RNA secondary structure prediction: focus on pseudoknots.
创建时间:
2025-02-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作