five

multimolecule/archiveii.1024

收藏
Hugging Face2024-11-15 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/multimolecule/archiveii.1024
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: rna tags: - Biology - RNA license: - agpl-3.0 size_categories: - 10K<n<100K source_datasets: - multimolecule/bprna - multimolecule/pdb task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling pretty_name: ArchiveII library_name: multimolecule --- # ArchiveII ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks. ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures. It is considered complementary to the [RNAStrAlign](./rnastralign) dataset. ## Disclaimer This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al. **The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** ## Dataset Description - **Homepage**: https://multimolecule.danling.org/datasets/archiveii - **datasets**: https://huggingface.co/datasets/multimolecule/archiveii - **Point of Contact**: [Mehdi Saman Booy](mailto:mehdi.samanbooy@aalto.fi) ## Example Entry | id | sequence | secondary_structure | family | | ------------------- | ----------------------------------- | ------------------------------------ | ---------- | | 16S_rRNA-A.fulgidus | AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA... | ...(((((...(((.))))).((((((((((.... | 16S_rRNA | ## Column Description - **id**: A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset. - **sequence**: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases: - **A**: Adenine - **C**: Cytosine - **G**: Guanine - **U**: Uracil - **secondary_structure**: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard: - **Dots (`.`)**: Represent unpaired nucleotides. - **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1). - **family**: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc. ## Variations This dataset is available in two additional variants: - [archiveii](https://huggingface.co/datasets/multimolecule/archiveii): The main ArchiveII dataset. - [archiveii.512](https://huggingface.co/datasets/multimolecule/archiveii.512): ArchiveII dataset with sequences no longer than 512 nucleotides. - [archiveii.1024](https://huggingface.co/datasets/multimolecule/archiveii.1024): ArchiveII dataset with sequences no longer than 1024 nucleotides. ## Related Datasets - [RNAStrAlign](https://huggingface.co/datasets/multimolecule/rnastralign): A database of RNA secondary with the same families as ArchiveII, usually used for training. - [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction. ## License This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ``` ## Citation ```bibtex @article{samanbooy2022rna, author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka}, journal = {BMC Bioinformatics}, keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction}, month = feb, number = 1, pages = {58}, publisher = {Springer Science and Business Media LLC}, title = {{RNA} secondary structure prediction with convolutional neural networks}, volume = 23, year = 2022 } ```
提供机构:
multimolecule
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作