tattabio/OG
收藏Hugging Face2024-08-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/tattabio/OG
下载链接
链接失效反馈官方服务:
资源简介:
OG数据集是Open MetaGenomic数据集(OMG)的一个子集,包含高质量的原核生物和病毒基因组,并带有分类信息。数据集被预处理为混合模态数据集,包含蛋白质编码序列的翻译氨基酸和基因间序列的核酸。数据集的每一行代表一个基因组支架,作为氨基酸编码序列(CDS)和核苷酸基因间序列(IGS)的有序列表。
The OG dataset is a subset of the Open MetaGenomic dataset (OMG), containing high-quality prokaryotic and viral genomes with taxonomic information. The dataset is pre-processed into a mixed-modality dataset, including translated amino acids for protein coding sequences and nucleic acids for intergenic sequences. The dataset features include CDS_seqs (a list of amino acid CDS sequences), IGS_seqs (a list of nucleotide IGS sequences), CDS_position_ids (a list of integers representing the position of each CDS element in the scaffold), IGS_position_ids (a list of integers representing the position of each IGS element in the scaffold), CDS_ids (a list of string identifiers for each CDS element), IGS_ids (a list of string identifiers for each IGS element), and CDS_orientations (a list of booleans indicating the orientation of each CDS). The dataset is available in a train split with a specified number of examples and bytes. Additionally, related datasets are mentioned, and a citation for the dataset is provided.
提供机构:
tattabio



