omicseye/prok_heavy
收藏Hugging Face2025-03-17 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/omicseye/prok_heavy
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于预训练的基因组数据集,包含来自美国国家生物技术信息中心(NCBI)数据库的19,551个代表性基因组,截至2024年2月23日。这些基因组包括18,268个细菌基因组、647个古菌基因组、577个真菌基因组、40个病毒基因组以及1个人类参考基因组。数据集通过将基因组切割成长度为3200个核苷酸、重叠100个碱基的片段来创建样本。最终的数据集分为训练集、验证集和测试集,样本数分别为27,831,882、3,478,985和3,478,986。每个样本包含基因组组装标识符、序列接入标识符、文件中的行号、序列读取的起始位置、DNA序列数据、序列样本长度和数据集分割类别等信息。
This dataset is a pretraining genomic dataset that includes 19,551 representative genomes sourced from the National Center for Biotechnology Information (NCBI) database up to February 23, 2024. It consists of 18,268 bacterial, 647 archaeal, 577 fungal, 40 viral, and 1 human reference genome. The dataset is created by slicing the genomes into samples of 3200 nucleotide sequences with 100 bases of overlap. The dataset is split into training, validation, and test sets with 27,831,882, 3,478,985, and 3,478,986 samples respectively. Each sample includes information such as genome assembly identifier, sequence accession identifier, line number in the file, start position of the sequence read within the accession ID, DNA sequence data, length of the DNA sequence sample, and dataset split category.
提供机构:
omicseye



