five

tattabio/OG

收藏
Hugging Face2024-08-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/tattabio/OG
下载链接
链接失效反馈
官方服务:
资源简介:
OG数据集是Open MetaGenomic数据集(OMG)的一个子集,包含高质量的原核生物和病毒基因组,并带有分类信息。数据集被预处理为混合模态数据集,包含蛋白质编码序列的翻译氨基酸和基因间序列的核酸。数据集的每一行代表一个基因组支架,作为氨基酸编码序列(CDS)和核苷酸基因间序列(IGS)的有序列表。

The OG dataset is a subset of the Open MetaGenomic dataset (OMG), containing high-quality prokaryotic and viral genomes with taxonomic information. The dataset is pre-processed into a mixed-modality dataset, including translated amino acids for protein coding sequences and nucleic acids for intergenic sequences. The dataset features include CDS_seqs (a list of amino acid CDS sequences), IGS_seqs (a list of nucleotide IGS sequences), CDS_position_ids (a list of integers representing the position of each CDS element in the scaffold), IGS_position_ids (a list of integers representing the position of each IGS element in the scaffold), CDS_ids (a list of string identifiers for each CDS element), IGS_ids (a list of string identifiers for each IGS element), and CDS_orientations (a list of booleans indicating the orientation of each CDS). The dataset is available in a train split with a specified number of examples and bytes. Additionally, related datasets are mentioned, and a citation for the dataset is provided.
提供机构:
tattabio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作