InstaDeepAI/multi_species_genomes

Name: InstaDeepAI/multi_species_genomes
Creator: InstaDeepAI
Published: 2024-07-19 11:45:24
License: 暂无描述

Hugging Face2024-07-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/InstaDeepAI/multi_species_genomes

下载链接

链接失效反馈

官方服务：

资源简介：

多物种基因组数据集是通过解析NCBI上可用的基因组构建的，每个属中仅随机选择一个物种。该数据集不包括植物和病毒基因组，因为它们的调控元件与论文任务中的兴趣点不同。最终收集的基因组被下采样至850个物种，代表了1740亿个核苷酸，大约生成290亿个标记。数据集用于预训练核苷酸变换器模型，每个序列长度为6,200或12,200个碱基对。数据集包含三个部分：训练集、验证集和测试集。

提供机构：

InstaDeepAI

原始信息汇总

数据集概述

数据集名称

名称: Human Reference Genome
别名: Multi-species genome

数据集描述

构建方式: 通过解析NCBI上可用的基因组，任意选择每个属中的一个物种，排除植物和病毒基因组。
物种数量: 850种
总核苷酸数: 174亿
总令牌数: 约29亿

数据集组成

分类及数量:

Class Number of species Number of nucleotides (B)

Bacteria 667 17.1

Fungi 46 2.3

Invertebrate 39 20.8

Protozoa 10 0.5

Mammalian Vertebrate 31 69.8

Other Vertebrate 57 63.4

数据集用途

用途: 作为Nucleotide Transformers模型的预训练语料库。
序列长度: 6,200或12,200碱基对。

数据集结构

数据实例:
- sequence: DNA序列
- description: 序列描述，包括物种和NCBI ID
- start_pos: 序列起始位置
- end_pos: 序列结束位置
- fasta_url: 下载序列的FASTA文件URL

数据集语言

语言: DNA

引用信息

bibtex @article{dalla2023nucleotide, title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics}, author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others}, journal={bioRxiv}, pages={2023--01}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集

Class	Number of species	Number of nucleotides (B)
Bacteria	667	17.1
Fungi	46	2.3
Invertebrate	39	20.8
Protozoa	10	0.5
Mammalian Vertebrate	31	69.8
Other Vertebrate	57	63.4