InstaDeepAI/plant-multi-species-genomes

Name: InstaDeepAI/plant-multi-species-genomes
Creator: InstaDeepAI
Published: 2024-04-08 21:14:51
License: 暂无描述

Hugging Face2024-04-08 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/InstaDeepAI/plant-multi-species-genomes

下载链接

链接失效反馈

官方服务：

资源简介：

Plant Multi Species数据集是通过解析NCBI上的植物基因组构建的，包含48种不同植物的基因组序列。每个序列长度为6,200个碱基对，用于AgroNT模型的预训练。数据集包含三个分割：训练集、验证集和测试集。每个数据实例包括序列字符串、序列描述字符串以及序列的起始和结束位置。

提供机构：

InstaDeepAI

原始信息汇总

数据集概述

名称: Plant Multi Species Genomes

标签: DNA, Genomics, Nucleotide

构建方式: 通过解析NCBI上可用的植物基因组数据构建。

包含物种数量: 48种不同植物物种

数据集用途: 作为AgroNT模型的预训练语料库。

序列特性: 每个序列长度为6,200碱基对。序列间存在重叠，允许模型在每个epoch中从不同位置开始处理序列，以覆盖整个染色体。

数据实例结构

sequence: 包含DNA序列的字符串
description: 包含物种信息及NCBI ID的字符串
start_pos: 序列起始位置的整数索引
end_pos: 序列结束位置的整数索引

数据分割

训练集
验证集
测试集

引用信息

bibtex @article{mendoza2023foundational, title={A Foundational Large Language Model for Edible Plant Genomes}, author={Mendoza-Revilla, Javier and Trop, Evan and Gonzalez, Liam and Roller, Masa and Dalla-Torre, Hugo and de Almeida, Bernardo P and Richard, Guillaume and Caton, Jonathan and Lopez Carranza, Nicolas and Skwark, Marcin and others}, journal={bioRxiv}, pages={2023--10}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集