afg1/rnacentral_subset
收藏Hugging Face2024-04-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/afg1/rnacentral_subset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
size_categories:
- 1M<n<10M
---
This is a parquet-ified subset of the RNAcentral active fasta file available here: https://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/24.0/sequences/rnacentral_active.fasta.gz
I have preprocessed it a bit, requiring only sequences less than 8192 nt long, and having no ambiguous nucleotides (i.e. no Ns or other non standard things)
This dataset is about 10% of the overall, and comprises 3,252,483 (3.2 million) sequences, or 2,642,703,990 (2.6 billion) bases.
The train/val/test split is 60/20/20
提供机构:
afg1
原始信息汇总
数据集概述
数据集基本信息
- 许可证: CC0-1.0
- 数据集大小: 1M<n<10M
数据来源与处理
- 原始数据: 来自RNAcentral的active fasta文件,地址为https://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/24.0/sequences/rnacentral_active.fasta.gz
- 预处理: 仅包含长度小于8192 nt且无模糊核苷酸(无N或其他非标准字符)的序列
数据集规模
- 序列数量: 3,252,483 (3.2 million)
- 总碱基数: 2,642,703,990 (2.6 billion)
数据集分割
- 训练/验证/测试比例: 60/20/20



