five

multimolecule/archiveii.512

收藏
Hugging Face2024-11-15 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/multimolecule/archiveii.512
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: rna tags: - Biology - RNA license: - agpl-3.0 size_categories: - 10K<n<100K source_datasets: - multimolecule/bprna - multimolecule/pdb task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling pretty_name: ArchiveII library_name: multimolecule --- # ArchiveII ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks. ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures. It is considered complementary to the [RNAStrAlign](./rnastralign) dataset. ## Disclaimer This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al. **The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** ## Dataset Description - **Homepage**: https://multimolecule.danling.org/datasets/archiveii - **datasets**: https://huggingface.co/datasets/multimolecule/archiveii - **Point of Contact**: [Mehdi Saman Booy](mailto:mehdi.samanbooy@aalto.fi) ## Example Entry | id | sequence | secondary_structure | family | | ------------------- | ----------------------------------- | ------------------------------------ | ---------- | | 16S_rRNA-A.fulgidus | AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA... | ...(((((...(((.))))).((((((((((.... | 16S_rRNA | ## Column Description - **id**: A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset. - **sequence**: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases: - **A**: Adenine - **C**: Cytosine - **G**: Guanine - **U**: Uracil - **secondary_structure**: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard: - **Dots (`.`)**: Represent unpaired nucleotides. - **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1). - **family**: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc. ## Variations This dataset is available in two additional variants: - [archiveii](https://huggingface.co/datasets/multimolecule/archiveii): The main ArchiveII dataset. - [archiveii.512](https://huggingface.co/datasets/multimolecule/archiveii.512): ArchiveII dataset with sequences no longer than 512 nucleotides. - [archiveii.1024](https://huggingface.co/datasets/multimolecule/archiveii.1024): ArchiveII dataset with sequences no longer than 1024 nucleotides. ## Related Datasets - [RNAStrAlign](https://huggingface.co/datasets/multimolecule/rnastralign): A database of RNA secondary with the same families as ArchiveII, usually used for training. - [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction. ## License This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ``` ## Citation ```bibtex @article{samanbooy2022rna, author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka}, journal = {BMC Bioinformatics}, keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction}, month = feb, number = 1, pages = {58}, publisher = {Springer Science and Business Media LLC}, title = {{RNA} secondary structure prediction with convolutional neural networks}, volume = 23, year = 2022 } ```

--- language: 核糖核酸(RNA) tags: - 生物学 - 核糖核酸(RNA) license: - AGPL-3.0 size_categories: - 1万 < 样本量 < 10万 source_datasets: - multimolecule/bprna - multimolecule/pdb task_categories: - 文本生成 - 掩码填充 task_ids: - 语言建模 - 掩码语言建模 pretty_name: ArchiveII library_name: multimolecule --- # ArchiveII数据集 ArchiveII是一款收录核糖核酸(RNA)序列及其二级结构的数据集,广泛应用于RNA二级结构预测基准测试中。 ArchiveII共包含2975条RNA样本,涵盖10个RNA家族,序列长度范围为28至2968个核苷酸。 本数据集常被用于评估RNA二级结构预测方法,包括可处理假结结构与非假结结构的各类方法。 该数据集被认为是[RNAStrAlign](./rnastralign)数据集的互补数据集。 ## 免责声明 本版本为Mehdi Saman Booy等人发布的ArchiveII非官方发布版本。 **ArchiveII的原发布团队并未编写本数据集卡片,本卡片由MultiMolecule团队撰写。** ## 数据集说明 - **主页**: https://multimolecule.danling.org/datasets/archiveii - **数据集仓库**: https://huggingface.co/datasets/multimolecule/archiveii - **联系人**: [Mehdi Saman Booy](mailto:mehdi.samanbooy@aalto.fi) ## 示例条目 | 样本ID | 序列 | 二级结构 | 家族 | | ------------------- | ----------------------------------- | ------------------------------------ | ---------- | | 16S_rRNA-A.fulgidus | AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA... | ...(((((...(((.))))).((((((((((.... | 16S_rRNA | ## 字段说明 - **id**: 每条RNA条目的唯一标识。该ID由其所属家族与原始`.sta`文件名生成,用于指向数据集中的特定RNA结构。 - **sequence**: RNA分子的核苷酸序列,采用标准RNA碱基表示: - **A**: 腺嘌呤(Adenine) - **C**: 胞嘧啶(Cytosine) - **G**: 鸟嘌呤(Guanine) - **U**: 尿嘧啶(Uracil) - **secondary_structure**: RNA二级结构,采用点括号标记法(dot-bracket notation)表示,遵循bpRNA的标准,最多使用三类符号来标识碱基配对与未配对区域: - **点(`.`)**: 代表未配对的核苷酸。 - **括号(`(`与`)`)**: 代表标准茎区的碱基配对(第1页)。 - **family**: 该序列所属的RNA家族,例如16S核糖体RNA(16S rRNA)、5S核糖体RNA(5S rRNA)等。 ## 数据集变体 本数据集另有三种额外变体: - [archiveii](https://huggingface.co/datasets/multimolecule/archiveii): ArchiveII主数据集。 - [archiveii.512](https://huggingface.co/datasets/multimolecule/archiveii.512): 序列长度不超过512个核苷酸的ArchiveII数据集。 - [archiveii.1024](https://huggingface.co/datasets/multimolecule/archiveii.1024): 序列长度不超过1024个核苷酸的ArchiveII数据集。 ## 相关数据集 - [RNAStrAlign](https://huggingface.co/datasets/multimolecule/rnastralign): 与ArchiveII包含相同RNA家族的RNA二级结构数据库,通常用于模型训练。 - [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): 另一种常用于RNA二级结构预测的常用数据库。 ## 许可协议 本数据集采用[AGPL-3.0许可协议](https://www.gnu.org/licenses/agpl-3.0.html)进行授权。 spdx SPDX-License-Identifier: AGPL-3.0-or-later ## 引用信息 bibtex @article{samanbooy2022rna, author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka}, journal = {BMC Bioinformatics}, keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction}, month = feb, number = 1, pages = {58}, publisher = {Springer Science and Business Media LLC}, title = {{RNA} secondary structure prediction with convolutional neural networks}, volume = 23, year = 2022 }
提供机构:
multimolecule
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作